SkillOpt: a Mechanism That Makes Skills Training Stable

The Skill Document Sitting in Your Agent's System Prompt Is Not Being Trained

Somewhere in your enterprise AI stack, a frozen frontier model is running tasks against a skill document that a prompt engineer wrote six weeks ago, or that a one-shot LLM call generated during setup. When the agent fails on a specific document type, or misbehaves inside a Codex or Claude Code loop, the fix is manual: someone edits the prompt, redeploys, and hopes. There is no validation step. There is no memory of what was tried and rejected. There is no concept of a learning rate. The skill just sits there, static, while the model it guides keeps encountering the same failure patterns.

This is the default state of skill management for any organization running closed frontier models. Weight adaptation is unavailable for GPT-5.x and expensive even for open models. The skill document is the only adaptation layer you have. And until now, nobody treated it as something that could actually be optimized.

Researchers from Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University published SkillOpt in May 2026, addressing this gap directly. The system introduces what it describes as the first controllable text-space optimizer for agent skills: a separate optimizer model converts scored task rollouts into bounded add, delete, and replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. On GPT-5.5 in direct chat, SkillOpt lifts average no-skill accuracy by 23.5 points. Inside Codex, the lift is 24.8 points. Inside Claude Code, 19.1 points. Across all 52 evaluated combinations of model, benchmark, and execution harness, SkillOpt is best or tied-best in every single cell against every competitor tested. The central claim is not that SkillOpt writes better prompts than humans. It is that skill editing, when treated with the same discipline applied to neural network weights, produces artifacts that consistently outperform everything else, including domain expert hand-crafted skills, and transfer across models and harnesses without retraining.

Why Every Prior Approach Has a Validation-Shaped Hole in It

The failure modes of existing approaches are not random. They share a structural problem: none of them closes the loop between generating an edit and verifying whether that edit actually improves the target model's performance on held-out tasks.

Human-crafted skills are brittle under specific target domains and harnesses. They are beaten by SkillOpt in every direct-chat model row in the study, despite being 145 to 516 tokens of carefully written domain expertise. One-shot LLM skills are generated once and never updated. Trace2Skill distills trajectory lessons but has no held-out validation gate, so plausible-sounding diagnoses that actually hurt performance are never filtered out. TextGrad optimizes prompts but produces outright negative results in several configurations: minus 0.7 on SpreadsheetBench for GPT-5.5 and minus 15.3 on SpreadsheetBench for Qwen3.6-35B. GEPA evolves prompts via reflective feedback but targets system designs rather than portable, reusable skill artifacts. EvoSkill lacks both the bounded edit budget and the rejected-edit memory that prevent a single bad epoch from overwriting accumulated progress.

The common thread is what the paper calls the problem with unbounded rewrites: they can erase useful rules, introduce incompatible instructions, or overfit to a local failure. Without a held-out gate, a validation-rejected edit looks identical to a validated one. The skill degrades, and nobody knows why.

The Mechanisms That Make Skill Training Stable

SkillOpt's architecture maps deliberately onto the controls that make neural network training reproducible. The analogy is not decorative: each component solves a specific instability that appears when you remove it.

SkillOpt Mechanism	What It Controls	What Breaks Without It
Textual Learning Rate (edit budget L_t)	Maximum edits accepted per step, with cosine decay	Ablation: SearchQA/SpreadsheetBench/LiveMath drop from 87.1/77.5/61.3 to 84.6/75.7/57.3
Validation Gate	Every candidate skill tested on held-out split before acceptance	Baselines without gates (Trace2Skill, TextGrad) produce negative results in multiple configurations
Epoch-Wise Slow/Meta Update	Cross-epoch longitudinal consolidation into protected skill region	SpreadsheetBench drops 22.5 points when this component is removed — the largest single degradation in the ablation suite

The slow/meta update is the component most likely to be underestimated, and the one with the largest empirical impact. At the end of each epoch, the optimizer compares the same training tasks under the previous epoch's skill and the current one, groups outcomes into improvements, regressions, persistent failures, and stable successes, and writes a compact longitudinal guidance block into a markup-fenced protected region of the skill document. Step-level edits cannot touch this region. Only the epoch-boundary process can rewrite it. The result is a separation between fast intra-epoch updates and slower cross-epoch consolidation of durable procedural lessons.

The rejected-edit buffer works alongside this. When a candidate skill fails the validation gate, the rejected edits and their associated failure patterns are stored in an epoch-local buffer that gets prepended to all subsequent optimizer calls within that epoch. The optimizer learns not to repeat failed edits within the same training run. The buffer adds zero inference-time cost at deployment because it is discarded entirely before the skill is exported.

What the Optimization Loop Actually Does, Step by Step

The full pipeline runs entirely offline. The deployed artifact is a single markdown file. Here is what the training loop does:

The frozen target model runs a batch of tasks from the training split with the current skill injected into its context. Scored trajectories are collected, including task metadata, tool calls, observations, and verifier feedback.
Trajectories are partitioned into failure and success groups, then into minibatches. The optimizer model analyzes failure minibatches to propose missing or corrective rules. It analyzes success minibatches to propose rules that preserve working behaviors.
Failure-prioritized merge combines proposals. The optimizer ranks them and clips to the top L_t edits (default 4, cosine decay to floor 2). This is the textual learning rate in operation.
The selected edits are applied to produce a candidate skill. The candidate is evaluated on the held-out selection split using the frozen target model.
If the candidate strictly improves the selection score (ties are rejected), it is accepted. If it is also the best candidate seen so far, it updates best_skill.md. If rejected, the edits enter the rejected-edit buffer.
At epoch end, the slow update compares the previous and current epoch-end skills across a sampled set of training tasks, writes longitudinal guidance into the protected region, and validates this update against the selection split before committing it.
The optimizer also maintains a meta skill document that records which edit patterns helped, which were rejected, and which failures persisted. This meta skill is never deployed. It guides the optimizer's future prompts only.

At deployment, the target model runs with only best_skill.md prepended to its system or developer instructions. The optimizer model is not present at inference. The mechanism adds zero latency.

The Numbers Across Six Benchmarks and What They Mean for Closed-Model Deployments

The evaluation spans 52 model-benchmark-harness combinations across GPT-5.5, GPT-5.4, GPT-5.4-mini, GPT-5.4-nano, GPT-5.2, and Qwen3.5 variants. SkillOpt is best or tied-best on all 52 cells. Some specific findings are worth isolating.

On SpreadsheetBench with GPT-5.5 in direct chat, SkillOpt reaches 80.7 against a no-skill baseline of 41.8. That is a 38.9-point lift. Human-crafted skills reach 72.9. The closest automated competitor, GEPA, reaches 73.6. TextGrad produces 41.1 — below the no-skill baseline. The same benchmark with GPT-5.5 inside Codex produces a 24.8-point average lift. The gap between SkillOpt and the next-best competitor on SpreadsheetBench is not attributable to a single mechanism: it is where the slow/meta update has the most visible effect, and its ablation shows exactly what happens without it.

On LiveMath with GPT-5.5 in direct chat, SkillOpt reaches 66.9 against a no-skill baseline of 37.6, a 29.3-point lift. Human skills reach 38.4. One-shot LLM skills reach 40.0. This is a domain where human expertise translates poorly into written procedural rules but the optimization loop can identify what the model is consistently getting wrong and encode corrections.

On GPT-5.4-nano, a much weaker model, DocVQA accuracy rises from 30.8 to 80.2, a 49.4-point lift. Human skills reach 73.5. The pattern holds: a validated, iteratively optimized skill document extracts more performance from a small model than a domain expert can write by hand.

Transfer experiments add a strategically important finding. A skill optimized for GPT-5.5 retains value when moved to GPT-5.4 without retraining. A skill optimized for Codex transfers to Claude Code. A skill optimized for SpreadsheetBench transfers to a nearby math benchmark. The paper qualifies this: transfer experiments cover a limited set of model pairs and benchmark adjacencies, so generalization to radically different domains or model families is not yet demonstrated. But the implication for organizations managing multiple models or harnesses is direct: a single optimization run can produce an artifact with durable value across several deployment contexts.

What a Practical Deployment Path Looks Like

SkillOpt's code is publicly available and the system is harness-agnostic. A sequenced deployment makes sense given that the optimization loop requires defining a benchmark, a train-selection-test split, and a scoring function specific to the target domain.

Phase 1: Define the measurement surface. Pick one domain where your frozen model's current skill or system prompt is measurably underperforming. Build a task set with a verifiable scoring function, at least enough tasks to split into train, selection, and test. This is the constraint that makes the validation gate meaningful. If you cannot define a score, the optimization loop has nothing to gate on.

Phase 2: Run SkillOpt against a single model and benchmark. Use the default configuration: cosine learning rate schedule, L_t starting at 4, minibatch size of 4, validation gate requiring strict improvement. Treat the resulting best_skill.md as your test artifact. Measure test-set performance against your prior skill and against the no-skill baseline. If the test-set result does not improve, the scoring function or task split needs review before proceeding.

Phase 3: Test transfer before committing to additional optimization runs. If you operate multiple models or harnesses, inject the Phase 2 skill into those contexts before running separate optimization cycles. The transfer results in the paper suggest this is often sufficient. Save optimization compute for domains where transfer is insufficient.

The artifact you are left with is a 300 to 2,000-token markdown file. It is human-readable, auditable, and deployable by prepending to any system instruction. The optimization mechanism that produced it is entirely offline. What SkillOpt changes is not the inference architecture. It changes how you decide what goes into the system prompt, replacing manual iteration with a process that has a learning rate, a validation gate, and memory of what it already tried.

The question worth sitting with is not whether your current skill documents are good. It is whether the process that produced them has any of those properties.

Agents Applied covers AI research that changes how organizations build and deploy intelligent systems. Published weekly for senior technology leaders.