Stop Wasting Hours on AI Research Setup—Use AI-Research-SKILLs Instead

What if your coding agent could go from zero to published research paper without you lifting a finger?

Not just code generation. Not just autocomplete. Full autonomous research—literature survey, hypothesis generation, experiment execution, mechanistic interpretability analysis, and LaTeX-formatted paper writing. While you sleep.

Here's the dirty secret most AI researchers won't admit: they spend 70% of their time debugging infrastructure instead of testing hypotheses. Wrestling with DeepSpeed configurations. Chasing down the right vLLM parameters. Figuring out why their QLoRA fine-tune OOMs on an A100. The actual science? That gets whatever scraps of attention remain after the tooling battle is won.

But what if you could package 98 production-ready, expert-level skills into your Claude Code, Codex, or Gemini agent—skills so comprehensive they cover everything from autoresearch orchestration to Flash Attention optimization to NeurIPS-ready paper formatting?

That's exactly what Orchestra-Research/AI-Research-SKILLs delivers. And it's about to change how you think about AI research forever.

What is AI-Research-SKILLs?

AI-Research-SKILLs is the most comprehensive open-source skills library purpose-built to transform any AI coding agent into an autonomous research powerhouse. Maintained by Orchestra Research, this isn't another scattered collection of prompts or half-baked tutorials—it's a systematically engineered research infrastructure spanning 23 categories and 98 meticulously documented skills.

The library's architecture reflects a brutal truth about modern AI research: the gap between idea and execution has become a chasm. You might know you need to run GRPO with PPO for your alignment experiment, but do you know the exact TRL configuration that avoids the reward hacking pattern described in open GitHub issues? Can your agent diagnose why your Megatron-Core pipeline stalls at 47% MFU? Should you use SimPO or DPO for your preference optimization—and does your choice change if you're training on consumer GPUs versus a Lambda Labs H100 cluster?

AI-Research-SKILLs answers all of this. Each skill contains 300KB+ of documentation sourced from official repositories, real GitHub issues with verified solutions, version histories with breaking changes, and production-tested code patterns. The autoresearch skill—the crown jewel—implements a two-loop architecture (inner optimization loop plus outer synthesis loop) that orchestrates the entire research lifecycle, automatically routing to domain-specific skills as needed.

The project exploded onto the scene in late 2025 and has been accelerating ever since. From 5 initial fine-tuning skills in November 2025 to 98 skills across 23 categories by April 2026, Orchestra Research has shipped at a pace that makes most open-source projects look stagnant. Two peer-reviewed-quality papers have already been produced entirely by autonomous agents using these skills—one discovering that norm heterogeneity predicts LoRA fine-tuning difficulty with r=-0.99, another revealing that DPO is fundamentally a rank-1 perturbation while online RL preserves distributed structure.

Key Features That Separate It From Everything Else

Research-Grade Documentation Density: Each skill packs 200-600 lines of focused guidance with progressive disclosure. The GRPO-RL-Training skill alone contains 569 lines plus references—a gold standard that walks through group relative policy optimization with the exact TRL patterns that avoid common failure modes.

Autoresearch Orchestration Layer: This isn't a skill; it's a meta-cognitive architecture. The autoresearch skill manages literature survey, ideation, experiment design, execution, and paper writing through structured state files (research-state.yaml, findings.md, research-log.md). It supports continuous operation via Claude Code's /loop command or OpenClaw heartbeat cron jobs—meaning your agent literally researches while you're offline.

Agent-Native Research Artifacts (ARA): The April 2026 v1.6.0 release introduced three revolutionary skills that turn research outputs into falsifiable, auditable, agent-traversable artifacts. The ARA Compiler structures inputs into claims, exploration graphs, evidence, and code stubs. The Research Manager extracts decisions and dead ends from conversation history with provenance tags. The Rigor Reviewer performs semantic epistemic review across six dimensions—essentially automated peer review.

One-Command Installation: The npm package @orchestra-research/ai-research-skills auto-detects your coding agent (Claude Code, Hermes Agent, OpenCode, Cursor, Gemini CLI, Codex, Qwen Code) and installs with symlinks or copies. Update, uninstall, or selectively install by category—all through a clean interactive CLI.

Real GitHub Issues, Real Solutions: Unlike documentation that pretends software has no bugs, these skills include actual GitHub issues with verified workarounds. Your agent encounters the DeepSpeed ZeRO-3 hang? The skill already knows the overlap_comm=False fix from issue #3472.

Production-Ready Code Patterns: Every skill includes executable patterns, not pseudocode. The vLLM skill shows exact PagedAttention configuration for throughput versus latency tradeoffs. The Megatron-Core skill details the 4D parallelism strategy that achieves 47% MFU on H100 clusters.

Use Cases Where AI-Research-SKILLs Destroys the Competition

Autonomous Literature-to-Paper Research

A graduate student needs to explore whether transformer normalization properties affect parameter-efficient fine-tuning. Instead of weeks of manual experimentation, they bootstrap the autoresearch skill with a seed hypothesis. The agent surveys literature, pivots when initial results are null, discovers the norm heterogeneity correlation, runs validation with Axolotl and PEFT skills, generates publication-quality plots, and outputs a formatted NeurIPS submission—in 48 hours of continuous operation.

Multi-Skill Mechanistic Interpretability

A research engineer wants to understand why DPO behaves differently from online RL. The agent loads the TRL and GRPO skills for training, then automatically invokes TransformerLens and SAELens skills for activation analysis, nnsight for remote 70B+ model experiments, and finally the ML Paper Writing skill to synthesize findings. The result: the discovery that "DPO is rank-1 alignment" with quantitative evidence from SVD recovery analysis.

Infrastructure-Agnostic Training at Scale

An ML engineer needs to train a 70B parameter model but doesn't know whether to use DeepSpeed ZeRO, PyTorch FSDP2, or Megatron-Core. The distributed training skills provide decision frameworks based on cluster size, network topology, and model architecture. The SkyPilot and Lambda Labs skills handle multi-cloud orchestration and GPU procurement. The agent automatically selects optimal configurations and handles spot instance recovery.

Safety-Critical Deployment Pipeline

A startup needs to deploy a customer-facing LLM with guardrails against prompt injection and harmful outputs. The agent sequentially loads Prompt Guard (99%+ TPR, <2ms GPU latency), LlamaGuard for input/output classification, NeMo Guardrails for programmable conversation constraints, and Constitutional AI for self-improvement principles—all coordinated through the safety skills category.

Step-by-Step Installation & Setup Guide

Getting started with AI-Research-SKILLs is deliberately frictionless. Orchestra Research optimized for agent-native installation—meaning your AI coding agent can install itself.

Method 1: Interactive Installer (Recommended for Humans)

# One command installs all 98 skills with interactive prompts
npx @orchestra-research/ai-research-skills

This launches the interactive installer which:

Auto-detects installed coding agents (Claude Code, Cursor, Gemini CLI, etc.)
Installs skills to ~/.orchestra/skills/ with symlinks (or copies on Windows)
Offers installation modes: complete bundle, quickstart, by category, or individual skills
Configures the autoresearch orchestration layer automatically

Method 2: Agent-Native Bootstrap (Recommended for AI Agents)

Point your agent to the welcome document and let it handle everything:

Read https://www.orchestra-research.com/ai-research-skills/welcome.md and follow the instructions to install and use AI Research Skills.

This is the killer feature for autonomous operation. Your agent reads the welcome doc, understands the installation protocol, executes it, and begins research without human intervention.

Method 3: Claude Code Marketplace (Category-Selective)

# Add the Orchestra Research marketplace
/plugin marketplace add orchestra-research/AI-research-SKILLs

# Install specific categories as needed
/plugin install fine-tuning@ai-research-skills        # Axolotl, LLaMA-Factory, PEFT, Unsloth
/plugin install post-training@ai-research-skills      # TRL, GRPO, OpenRLHF, SimPO, verl
/plugin install inference-serving@ai-research-skills  # vLLM, TensorRT-LLM, llama.cpp, SGLang
/plugin install distributed-training@ai-research-skills  # DeepSpeed, FSDP, Megatron-Core

Post-Installation Verification

# List all installed skills with versions
npx @orchestra-research/ai-research-skills list

# Update to latest versions
npx @orchestra-research/ai-research-skills update

# Verify autoresearch skill is loaded
claude /skill autoresearch status

Environment Configuration

For continuous autonomous operation, configure your agent with:

# Claude Code: enable loop mode for persistent research
export CLAUDE_CODE_LOOP_ENABLED=true
export CLAUDE_CODE_HEARTBEAT_INTERVAL=300  # 5-minute checkpoints

# Set research workspace
export ORCHESTRA_RESEARCH_DIR=~/ai-research-projects
mkdir -p $ORCHESTRA_RESEARCH_DIR/{literature,experiments,src,data,to_human}

REAL Code Examples from the Repository

Let's examine actual patterns from the AI-Research-SKILLs repository that demonstrate its depth and practical utility.

Example 1: Autoresearch Orchestration Bootstrap

The autoresearch skill's core architecture enables continuous research through structured state management. Here's how the two-loop system initializes:

# research-state.yaml — persistent project memory across sessions
project:
  title: "Norm Heterogeneity and LoRA Brittleness"
  status: active  # active | paused | completed | abandoned
  
# Inner loop: rapid experiment iteration
optimization_loop:
  current_hypothesis: "Layer norm statistics predict fine-tuning difficulty"
  experiment_count: 23
  last_result: "r=-0.99 correlation confirmed, null hypothesis on ETF overlaps rejected"
  
# Outer loop: synthesis and direction pivoting
synthesis_loop:
  key_finding: "Pre-norm variance > 0.3 strongly predicts LoRA convergence failure"
  pivot_history:
    - {from: "ETF overlap hypothesis", to: "norm heterogeneity", reason: "r=0.02, no signal"}
  
# Automatic skill routing — agent doesn't need to know which skill to call
skill_routing:
  literature_survey: 15-rag/sentence-transformers  # semantic paper search
  experiments: 03-fine-tuning/peft                  # LoRA implementation
  analysis: 04-mechanistic-interpretability/transformer-lens  # activation analysis
  writing: 20-ml-paper-writing                     # LaTeX output

This YAML structure is not documentation—it's executable state. The autoresearch skill parses this file, determines the current research phase, and automatically invokes the appropriate domain skills. The pivot_history array is crucial: it enables the agent to learn from failed hypotheses, a feature that produced the published finding about norm heterogeneity after the initial ETF overlap theory proved incorrect.

Example 2: Production GRPO Training with TRL

The GRPO-RL-Training skill—marked as gold standard with 569 lines—contains this exact pattern for group relative policy optimization:

# From 06-post-training/grpo-rl-training/SKILL.md
from trl import GRPOConfig, GRPOTrainer
from peft import LoraConfig
import torch

# GRPO-specific: no reference model needed (unlike PPO), reducing memory 50%
grpo_config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=8,           # G in GRPO: group size for relative rewards
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-6,
    logging_steps=10,
    # Critical: beta controls KL divergence from policy initialization
    # Too high = no exploration; too low = reward hacking
    beta=0.04,                   # Tuned for mathematical reasoning tasks
)

# LoRA for memory-efficient policy updates
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

trainer = GRPOTrainer(
    model="meta-llama/Llama-3.1-8B-Instruct",
    reward_funcs=[math_accuracy_reward, format_reward],  # Your reward functions
    args=grpo_config,
    train_dataset=dataset,
    peft_config=peft_config,  # Enables QLoRA-style training
)

# The skill includes this critical warning from real GitHub issues:
# "GRPO with beta < 0.02 causes mode collapse on reasoning tasks.
#  Monitor group reward variance; if it drops below 0.1, increase beta."
trainer.train()

Notice the inline operational wisdom—not just API documentation, but the empirical finding about beta thresholds that prevents days of debugging. This pattern is validated against the actual TRL repository's issue tracker.

Example 3: vLLM Production Inference Configuration

The inference-serving skills demonstrate production-hardened patterns. Here's the vLLM skill's approach to throughput-optimized serving:

# From 12-inference-serving/vllm/SKILL.md
from vllm import LLM, SamplingParams
import os

# PagedAttention is the key innovation: eliminates KV cache memory waste
# Default PyTorch attention: O(batch_size * max_seq_len^2) memory
# PagedAttention: O(batch_size * actual_seq_len) with block-based allocation

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,           # Split across 4 GPUs
    pipeline_parallel_size=1,         # Single pipeline stage
    gpu_memory_utilization=0.90,      # Leave 10% for CUDA overhead
    
    # Enable chunked prefill for interleaved prefill/decode
    # Critical for throughput: avoids head-of-line blocking
    enable_chunked_prefill=True,
    max_num_batched_tokens=4096,      # Chunk size for prefill
    
    # Prefix caching for multi-turn conversations / RAG
    enable_prefix_caching=True,       # Cache shared system prompts
    
    # Quantization for single-node 70B deployment
    quantization="AWQ",               # 4-bit, minimal accuracy loss per skill docs
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=2048,
    # Structured output via guided decoding
    guided_decoding={"type": "json", "schema": response_schema},
)

# The skill notes: "With AWQ + tensor_parallel=4 on 4xA100-80GB,
# achieves 2,847 tok/s throughput vs. 312 tok/s for HF transformers."
outputs = llm.generate(prompts, sampling_params)

The quantified performance claim (2,847 vs. 312 tok/s) is sourced from benchmark runs documented in the skill's references, not marketing copy. This matters when your agent makes deployment decisions autonomously.

Example 4: ARA Compiler for Research Reproducibility

The Agent-Native Research Artifact skills represent bleeding-edge research infrastructure. The ARA Compiler transforms chaotic research inputs into structured knowledge:

# From 22-agent-native-research-artifact/compiler/SKILL.md
from ara_compiler import ARACompiler, InputType

# Compile any research input into traversable artifact
compiler = ARACompiler()

artifact = compiler.compile(
    inputs=[
        {"path": "paper.pdf", "type": InputType.PDF},           # Literature
        {"path": "https://github.com/user/repo", "type": InputType.REPO},  # Code
        {"path": "experiments/", "type": InputType.LOGS},       # Results
        {"path": "notes.md", "type": InputType.RAW_NOTES},      # Thoughts
    ],
    output_dir="./ara/",
)

# Generated structure:
# ara/
# ├── cognitive_layer/
# │   ├── claims.jsonl          # Falsifiable statements with confidence
# │   ├── concepts.yaml         # Defined terms and relationships
# │   └── heuristics.md         # Rules of thumb discovered
# ├── physical_layer/
# │   ├── configs/              # Reproducible experiment configs
# │   └── code_stubs/           # Minimal working implementations
# └── exploration_graph/
#     ├── dag.json              # Research decisions as directed graph
#     └── evidence/             # Supporting/rejecting evidence per claim

# The Rigor Reviewer then scores this artifact
from ara_rigor_reviewer import RigorReviewer

reviewer = RigorReviewer()
report = reviewer.review(artifact)
# Scores: evidence_relevance, falsifiability, scope_calibration, 
#         argument_coherence, exploration_integrity, methodological_rigor
# Output: Strong Accept / Weak Accept / Weak Reject / Strong Reject

This isn't theoretical—it's the infrastructure that enables verifiable, auditable autonomous research. When your agent produces a finding, the ARA system traces exactly what evidence supports it, what alternatives were explored, and where the reasoning might fail.

Advanced Usage & Best Practices

Orchestrate Multiple Skills for Compound Research: The most powerful pattern isn't using one skill—it's the autoresearch skill chaining them automatically. But you can manually compose: start with Research Brainstorming for ideation, route to Mechanistic Interpretability skills for analysis, use Emerging Techniques skills for novel architectures, and finish with ML Paper Writing for publication.

Leverage Continuous Operation: The /loop command in Claude Code combined with autoresearch's heartbeat mechanism enables true 24/7 research. Configure checkpoint intervals (every 5 minutes) and set to_human/ directory thresholds for when the agent should pause for human review versus continue autonomously.

Version Pin Skills for Reproducibility: While npx update pulls latest skills, research reproducibility demands stability. Pin specific versions in your research-state.yaml:

skill_versions:
  grpo-rl-training: "1.6.0"      # Verified for this project
  transformer-lens: "1.6.0"
  vllm: "1.6.0"

Use ARA Provenance for Collaborative Research: The user / ai-suggested / ai-executed / user-revised provenance tags in ARA artifacts enable human-AI collaborative papers where contribution boundaries are transparent and auditable.

Monitor Skill Routing Decisions: The autoresearch skill logs all routing decisions to research-log.md. Review these to understand why your agent selected particular tools—this is how you debug agent cognition, not just code.

Comparison with Alternatives

Dimension	AI-Research-SKILLs	Generic Prompt Libraries	Framework Documentation	AutoML Tools
Scope	98 skills, full research lifecycle	Scattered prompts, no orchestration	Single framework only	Limited to HPO/ NAS
Agent Integration	Native: Claude Code, Codex, Gemini, etc.	Manual copy-paste	None	API-only
Real Issues & Fixes	✅ GitHub issues with solutions	❌ Rarely	❌ Official docs only	N/A
Autonomous Orchestration	✅ Two-loop autoresearch architecture	❌	❌	❌
Research Artifacts	✅ ARA: falsifiable, auditable	❌	❌	❌
Installation	One command, auto-detects agent	Manual	Package manager	Complex setup
Paper Writing	✅ LaTeX templates, citation verification	❌	❌	❌
Continuous Operation	✅ /loop + heartbeat	❌	❌	❌
Community Skills	SkillEvolve meta-skill for collective learning	❌	❌	❌

The fundamental difference: AI-Research-SKILLs is research infrastructure, not a tool collection. Generic prompt libraries give you isolated capabilities. Framework documentation gives you reference material. AutoML tools give you optimization. Only AI-Research-SKILLs provides the cognitive architecture for end-to-end autonomous research with verifiable outputs.

FAQ

Q: Do I need to be an expert in all 23 categories to use this? A: Absolutely not. That's the entire point. The autoresearch skill routes to domain skills automatically—you define the research question, the agent handles specialization. You should understand enough to verify outputs, but the skills contain the expertise.

Q: Which coding agents are supported? A: Claude Code, Hermes Agent, OpenCode, OpenClaw, Cursor, Codex (GitHub Copilot), Gemini CLI, and Qwen Code. The installer auto-detects available agents and configures symlinks appropriately.

Q: Can this really produce publication-quality research? A: Two papers have been produced entirely by autonomous agents using these skills, with findings strong enough for top-tier venues. The Norm Heterogeneity paper demonstrated autonomous pivoting from null results. The RL Brain Scan paper combined training, interpretability, and synthesis skills. Human review and refinement are still recommended, but the autonomous output is research-grade.

Q: How does this differ from just using Claude Code or Codex normally? A: Raw coding agents have general programming knowledge but lack deep, structured research expertise. AI-Research-SKILLs packages 130,000+ lines of specialized documentation into your agent's context, enabling it to make informed decisions about GRPO hyperparameters, Megatron parallelism strategies, or mechanistic interpretability techniques—domains where generic agents hallucinate or default to outdated patterns.

Q: Is my research data sent to Orchestra Research? A: No. Skills install locally to ~/.orchestra/skills/. The ARA system writes to your local filesystem. The only network calls are to the npm registry for installation and optional Slack community joining.

Q: What's the licensing? A: MIT License for the skills library. Individual referenced frameworks (PyTorch, TRL, vLLM, etc.) maintain their own licenses—always verify before commercial deployment.

Q: How do I contribute new skills? A: See CONTRIBUTING.md. The standardized structure requires SKILL.md with metadata, quick patterns, and references directory with official docs, issues, and tutorials.

Conclusion

AI-Research-SKILLs isn't just a productivity tool—it's a fundamental shift in how AI research gets done.

For decades, research progress was bottlenecked by human bandwidth: one researcher can only master so many frameworks, run so many experiments, read so many papers. By packaging 98 expert-level skills into an orchestrated, autonomous system, Orchestra Research has created something unprecedented: an AI research agent with genuine breadth and depth.

The evidence speaks for itself. Papers written by agents. Hypotheses pivoted autonomously when data doesn't fit. 130,000 lines of documentation distilled into actionable, verifiable research infrastructure. And it's all open-source, MIT-licensed, installable in one command.

But here's what excites me most: this is still early. The SkillEvolve meta-skill enables collective intelligence—techniques discovered by one researcher's agent get shared back as curated skills for everyone. The ARA system creates falsifiable, auditable research in an era of reproducibility crises. The two-loop autoresearch architecture will only improve as more skills are added and more agents run continuously.

If you're doing AI research in 2026 and not using AI-Research-SKILLs, you're choosing to fight with infrastructure instead of ideas. You're choosing manual tool configuration over autonomous experimentation. You're choosing slower science.

Don't do that.

Install it now. Point your agent at a research question. Let it run overnight. Wake up to findings, plots, and a structured artifact ready for human refinement.

👉 Get AI-Research-SKILLs on GitHub — One command to transform your coding agent into an autonomous researcher.

The future of AI research isn't human-only. It's not AI-only. It's human-AI orchestration—and AI-Research-SKILLs is the conductor's baton.