Back to blog
reinforcement-learningskill-libraryself-improving-agentstool-usesequential-rolloutGRPO

The AI Agent That Learns From Its Own Work

SAGE, a reinforcement learning framework from AWS and UW-Madison, trains AI agents to write and reuse programmatic skills across chains of related tasks, achieving 8.9% higher scenario completion than RL-without-skills baselines while using 59% fewer tokens.

March 29, 202611 min read

Source Paper

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong · University of Wisconsin–Madison; AWS Agentic AI (Amazon)

View Paper

When Your AI Agent Solves the Same Problem 400 Times and Never Gets Faster

If you have deployed an AI agent against a large volume of repetitive operational tasks, you have probably noticed something quietly frustrating: it does not get better. Task 400 takes just as long as Task 1. The agent looks up the same API documentation, reconstructs the same logic, and executes the same sequence of steps, every single time. There is no accumulation. The agent has perfect recall of its training data and zero memory of its own work.

This is not a prompt engineering problem. It is a training problem. Current agents are optimized to complete individual tasks, not to become more efficient across similar tasks over time. The skill libraries that researchers have built to address this are real, but they depend on the base model reliably following prompting instructions to generate and reuse structured code functions. With most open-source models, that reliability is not there. The library fills up with functions that never get called, or gets called with the wrong context and produces errors that cost more steps to recover from than the skill saved.

A team from AWS Agentic AI and the University of Wisconsin–Madison decided the problem was structural, not cosmetic. Their paper introduces SAGE (Skill Augmented GRPO for self-Evolution), a reinforcement learning framework that trains an agent to build a working skill library as part of its task-solving process. Applied to Qwen2.5-32B-Instruct on the AppWorld benchmark, an environment simulating real digital task automation across nine everyday apps, SAGE achieves 72.0% Task Goal Completion and 60.7% Scenario Goal Completion on the standard test set. The baseline RL approach without a skill library hits 69.2% and 51.8% respectively, while requiring 16.4 average interaction steps and 3,613 tokens per task. SAGE gets to the same work in 12.1 steps and 1,475 tokens. That 59% token reduction compounds directly into cost at scale. What makes you want to understand how it works is that the efficiency gains and the accuracy gains come from the same mechanism, not a trade-off between them.

The Compounding Logic That Prior Skill Libraries Could Not Capture

The core idea is straightforward but has a non-obvious implication. When a skill library agent processes a task, it does not call individual APIs directly. Instead, it writes a named, parameterized Python function encapsulating the multi-step logic, saves that function to its library, and then calls it. Next time a similar task arrives, the agent retrieves that function and runs it rather than reconstructing the logic from scratch.

That sounds like a productivity tool. What SAGE turns it into is a training signal.

Here is the problem with training agents on individual tasks: if you reward only task completion, the agent has no incentive to write a good reusable function versus a brittle one-off that happens to work this time. The reward looks identical either way. So the agent learns to complete tasks, not to build an accumulating capability.

SAGE solves this by making the reward span the task chain, not the task. Training happens across pairs of structurally similar tasks. The agent earns an extra reward point only when a skill it generated in Task 1 gets successfully used to complete Task 2. That linkage forces the model to learn that writing a clean, general function is worth the token cost, because the payoff appears in a different task.

The practical translation looks like this:

Academic TermWhat It Actually Means
Sequential RolloutThe agent is trained on pairs of related tasks in sequence, with the skill library carrying over between them
Skill-integrated RewardThe agent gets a bonus point when a skill from Task 1 is both saved and successfully reused in Task 2
Skill Generation RewardPaid when Task 1's function is later used by Task 2 and leads to success
Skill Usage RewardPaid when Task 2 successfully uses a function inherited from the library
SGC (Scenario Goal Completion)The share of three-task scenarios where all three tasks succeed — a proxy for how well skills transfer across similar work

This reward structure also carries a format enforcement: any agent response that terminates without generating code receives a -1.0 penalty. The agent learns that every step must produce executable logic, not a conversational answer.

Why Prompting Alone Could Not Get Here, and Why RL on Single Tasks Did Not Either

The skill library concept is not new. Voyager built one for Minecraft exploration. Agent Workflow Memory applied it to web browsing. The common architecture is: run the task, observe the trajectory, then use a separate LLM call to extract and save a reusable skill. Two problems compound over time.

First, this post-hoc extraction depends entirely on the base model's ability to follow a carefully crafted prompt and produce a well-formed, reusable function. With frontier models like GPT-4o, this works reasonably well. With open-source 32B-class models, it fails often enough to matter. The SAGE paper is direct about this: the baseline skill library agent using only Qwen2.5-32B-Instruct with prompting achieves just 30.7% Task Goal Completion, lower than the same model running without any skill library at all (34.7%). The library is not neutral; it actively hurts performance until the model is trained to use it properly.

Second, standard RL on individual tasks cannot fix this. Even if you add a skill library to the environment and let the model generate functions, the training signal only grades whether the current task succeeded. The quality of the skill for future tasks is invisible to the reward. SAGE's Sequential Rollout is the mechanism that makes future-task quality visible during training, and the Skill-integrated Reward is what puts a price on that visibility.

The authors tried variants to verify this. Using only an outcome-based reward (task completion, no skill-usage bonus) with the same sequential structure still trails SAGE by 5.3 percentage points in SGC. Using a chain-based reward (bonus only when both tasks succeed, without tracking whether a skill was actually responsible) trails by 4.1 points. The specific connection between skill generation in one task and skill use in the next is load-bearing.

How the Agent Decides What to Write, Save, and Reuse

The mechanics of a single training step are worth walking through, because they describe exactly how a deployed agent would behave.

  1. Task arrives. The agent receives a task instruction and a skill library containing any functions saved from prior tasks in the same scenario. On the first task in a chain, the library is empty.

  2. Library check. The agent scans retrieved functions for relevance. If a matching function exists, it calls it directly. If the function fails on execution, the agent is permitted to update it and retry.

  3. New function definition. If no suitable function exists, the agent writes one. The function must be abstract and parameterized, not hard-coded to the current task's specific values. A function named spotify_get_songs_by_genre(genre: str) qualifies. A function that hard-codes genre='pop' does not. The prompt enforces this distinction explicitly.

  4. Immediate execution. The agent calls the function it just wrote. If execution succeeds, the function is saved to the library. If it fails, the agent updates the definition and retries before saving.

  5. Reward calculation. At training time, the reward for Task 1 includes a bonus if and only if the function it saved was retrieved and used by Task 2, and Task 2 succeeded. The reward for Task 2 includes a bonus if it successfully used a function from the inherited library. Both bonuses equal 1.0, matching the base task completion reward in magnitude.

  6. Gradient update. The policy is updated across both tasks simultaneously, with the advantage for each output computed relative to other agents in the same training group facing the same task chain. Skills that helped across tasks reinforce writing behavior; skills that went unused or caused errors do not.

At evaluation time, tasks are presented with retrieved skills from prior work within the same scenario. The paper tests three retrieval approaches for cases where scenario labels are unavailable: N-gram similarity on the query text reaches 60.1% SGC (within 0.6 points of the ideal same-scenario ceiling), embedding similarity on queries reaches 59.5%, and embedding retrieval on the skill functions themselves reaches 56.0%. N-gram similarity is the most robust, because tasks within the same scenario in AppWorld share near-identical query structure, and lexical overlap captures this better than semantic embedding.

What This Looks Like Inside a Finance Operations Team Running Hundreds of Similar Tasks Daily

Consider a corporate finance operations team that has deployed an agent to handle cash management tasks: transferring funds between accounts, reconciling vendor payments, checking account balances against approval thresholds, and notifying stakeholders. These tasks arrive in high volume, the underlying apps and APIs are fixed, and roughly 80% of the work involves the same five or six patterns repeated across different users, amounts, and dates.

Before SAGE-style training, the agent treats each task independently. It looks up the Venmo API documentation, constructs the authentication logic, writes the payment execution code, and handles pagination through transaction records, every time. At 16 average interaction steps and 3,613 tokens per task, processing 500 tasks per day at a mid-tier inference cost produces a predictable, non-trivial line item that grows linearly with volume.

With an agent trained via SAGE, the first task in a cluster generates the reusable functions: venmo_authenticate_supervisor(), venmo_transfer_with_memo(recipient_id, amount, memo), venmo_get_recent_transactions(page_limit). These are saved. The next 499 tasks in structurally similar clusters retrieve those functions and execute them directly. Average steps drop to 12.1, average tokens to 1,475. At 500 tasks per day, that is a 59% reduction in inference token consumption, compounding without any additional prompt engineering or model changes.

The SGC improvement matters here more than the TGC improvement. TGC measures individual task success. SGC measures whether all three tasks in a cluster succeed. In finance operations, partial scenario completion is often operationally useless: if you successfully transferred the payment but failed to send the confirmation message and update the ledger, the task is not done. SAGE raises SGC from 51.8% to 60.7% on the standard benchmark, an 8.9 point gain that represents entire workflows completing cleanly rather than partial completions that require human remediation.

Critically, SAGE-trained agents eventually exceed the performance of the expert model used to teach them. The SFT phase uses Claude 3.5 Sonnet V2 trajectories as a teacher. Claude, running without skill library training, achieves 41.1% SGC. SAGE, starting from Claude's demonstrations and then learning through RL, reaches 60.7%. The student surpasses the teacher because RL explores combinations and efficiencies the teacher never demonstrated.

The Infrastructure Decision That Either Compounds or Resets

The immediate question for any team building production agents is whether SAGE is a framework to evaluate now or a direction to watch. The honest answer is: it depends on where you are in the SFT bootstrapping problem.

SAGE requires an initial supervised fine-tuning phase on high-quality expert trajectories before RL begins. Without it, the base model cannot generate well-formed skill functions reliably enough for RL to learn from. The paper's ablation is clear: starting RL directly from the base model yields 25.6% SGC. Starting from self-distillation (using the base model to generate its own training data rather than an expert) yields 53.6%. Starting from expert SFT data yields 60.7%. That gap between self-distillation and expert SFT is 7.1 points of SGC. For organizations without access to a frontier model to generate bootstrap trajectories, the gap is real and closing it takes investment.

Phase one: Build the expert trajectory dataset. Before any RL, you need a frontier model running your actual task environment and generating successful trajectories with proper function definitions. The paper used Claude 3.5 Sonnet V2 across AppWorld's 30 training scenarios with rejection sampling, keeping only scenarios where at least the first two tasks succeeded. This yielded 1,129 usable examples. For your domain, this phase is an audit of your task inventory: identify the high-frequency, structurally similar task clusters where skill transfer has the most leverage. Failure looks like a sparse trajectory dataset that the SFT model cannot generalize from.

Phase two: SFT on the expert trajectories. Fine-tune your target open-source model on the collected trajectories with environment observations masked and agent responses as the training target. The goal is not to replicate expert behavior perfectly. It is to give the RL phase a starting point where the model can write syntactically correct, parameterized functions reliably. The SFT model alone will not beat a well-tuned RL baseline, but it gives RL somewhere to go. Failure looks like a model that still hard-codes values or writes non-callable function bodies.

Phase three: SAGE training with Sequential Rollout. Construct task chains from your scenario clusters. Two-task chains are sufficient and computationally tractable; the paper found that three-task chains increase compute without improving performance, likely because reward asymmetry between the first and later tasks creates gradient instability. Run SAGE with the Skill-integrated Reward. Monitor the Success Skill Usage Rate specifically, not just overall TGC. If the rate is not climbing, the reward signal is not connecting skill generation to downstream task outcomes. Failure looks like a model that generates skills but does not reuse them, which means your scenario clustering is too loose and the tasks are not similar enough to share functions.

The strategic question underneath all of this is about what kind of value curve you are building. A standard RL agent improves until it plateaus on the training distribution and stays there. An agent trained with SAGE is learning a behavior, not just a policy: write reusable functions, save them, use them. That behavior persists at inference time and continues generating returns on every new task cluster the agent encounters, whether it saw those clusters in training or not. The difference between an agent that resets after every deployment and one that accumulates operational knowledge across tasks is not a performance gap today. It is a cost structure and a capability compounding rate. Organizations that get the bootstrapping phase right in the next 18 months will be running agents that are structurally cheaper and more capable than those starting from scratch later.