RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

Shuwen Xu1,2 Zhitao He1 Yi R. (May) Fung1
1Hong Kong University of Science and Technology 2University of Chinese Academy of Sciences

sxucn@connect.ust.hk yrfung@ust.hk

Abstract

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. RedAct addresses the tension between auditability and capability protection. It localizes protected key information, rewrites traces while preserving verifier-critical evidence, and injects behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, RedAct reduces normalized skill transfer from 44.7-67.1% on raw traces to below the no-skill baseline, while preserving the execution evidence needed to inspect what the agent did. Conceptually, RedAct reframes trace release as a controlled publication problem: the released artifact should still explain what happened during execution, but it should not expose a reusable recipe for reconstructing the owner's private skill.

Trace Disclosure Risk

Raw traces may reveal formulas, calibrated thresholds, tool choices, validation routines, and recovery strategies. RedAct treats trace release as a security interface: informative enough to audit, abstract enough to avoid direct procedural reuse.

This risk is especially relevant for long-horizon tool-using agents, where the value often lies not only in the final answer, but in the sequence of checks, tool calls, fallbacks, and domain-specific choices that led to it.

Problem motivation showing raw trace leakage and RedAct protection
Raw traces expose reusable private skills and enable skill distillation; RedAct protects released trajectories through selective rewriting and behavioral watermarking.

CapTraceBench

CapTraceBench contains specialized long-horizon tasks with task-local skill files, executable environments, and automatic verifiers for final success and step-level progress.

Each task is built around procedural knowledge that is useful for solving the environment but should not be trivially recoverable from a public trace. This makes the benchmark a stress test for protected trace release rather than a generic agent benchmark.

75Tasks
154Curated Skills
23Task Families
7Domains
CapTraceBench taxonomy and difficulty split
CapTraceBench spans seven domains and three difficulty levels.

Method

RedAct is deployed by the skill owner before trace publication. It sees the protected skill package, identifies private procedural items, releases an auditable protected trace, and optionally attaches behavioral provenance hooks.

The framework separates two goals that are often conflated: utility for human inspection and utility for downstream skill reconstruction. RedAct preserves the former while suppressing the latter.

RedAct pipeline for protected trace release
RedAct localizes protected procedural information, rewrites public agent traces, and injects behavioral watermarks for downstream provenance analysis.
1

Key-Item Localization

Identify formulas, constants, thresholds, specialized tools, validation routines, and private heuristics from the skill package.

2

Trace Rewriting

Abstract reusable procedural details while preserving execution order, tool-use evidence, final outputs, and verifier-critical fields.

3

Behavioral Watermarking

Insert neutral hooks such as Env Check and Ritual Marker that can persist in downstream students for provenance analysis.

Watermark hooks are functionally neutral action-observation patterns inserted into eligible protected traces. Standalone hooks provide broad provenance signals, while contextual hooks depend on tool observations or error states.

Ritual MarkerStandalone action pattern at task start or end.

Env CheckBenign environment-probing action.

Cross CheckContextual verification after tool observations.

Error AnchoringContextual recovery phrase after error feedback.

Experimental Results

The main table compares no skill access, oracle skill access, raw-trace reuse, and RedAct-protected trace reuse. We report success rate (SR) and step success rate (SSR) across six evaluated harness/model backends plus their average.

Downstream Reuse Settings

Single-Agent Skill Extraction

Synthesizes a reusable SKILL.md document and supporting scripts from released trajectories.

Multi-Agent Skill Evolution

Refines induced skills over multiple analyzer/evolver passes using successful and failed trajectories.

Retrieval Reuse

Indexes released traces and injects top-k similar snippets as in-context demonstrations at inference time.

Trajectory Fine-tuning

Fine-tunes a student on released trajectories to evaluate whether behavioral provenance signals persist.

Setting Diff. Claude Opus 4.6 Claude Sonnet 4.6 Claude Haiku 4.5 GPT-5.2 Codex Gemini 3 Pro Gemini 3 Flash Average
SRSSRSRSSRSRSSRSRSSRSRSSRSRSSRSRSSR
No SkillsEasy68.884.070.580.665.379.171.386.772.383.669.283.469.682.9
Medium42.371.936.266.127.560.941.371.541.867.937.867.037.867.5
Hard44.058.429.148.920.539.338.860.737.650.829.449.533.251.3
Avg.49.872.143.766.136.060.948.773.149.068.244.367.445.268.0
w/ Oracle SkillsEasy81.491.178.285.973.283.585.591.979.488.077.086.679.187.8
Medium61.581.548.072.447.972.162.782.154.976.049.673.554.176.3
Hard48.969.746.765.824.443.545.272.944.566.943.262.042.163.5
Avg.64.081.455.874.549.368.764.882.659.177.155.574.458.176.5
Raw Traces
w/ Extracted SkillsEasy80.087.076.782.570.581.478.391.575.386.575.885.076.185.6
Medium55.080.546.369.543.070.050.777.448.976.547.671.248.674.2
Hard47.167.544.958.321.042.143.268.041.954.738.756.439.557.8
Avg.59.979.354.170.445.366.756.479.054.474.253.171.553.973.5
w/ Evolved SkillsEasy74.586.074.582.769.881.274.589.479.586.075.485.574.785.1
Medium45.375.539.569.141.471.749.375.250.875.943.270.544.973.0
Hard49.161.440.262.531.748.842.569.541.364.642.163.841.161.8
Avg.53.975.149.071.246.869.054.577.756.376.051.573.052.073.7
w/ Retrieval ReuseEasy73.886.073.682.768.981.276.089.177.285.873.585.573.885.0
Medium44.974.838.868.636.264.647.274.448.073.242.069.842.970.9
Hard47.861.035.557.825.943.241.866.840.560.738.559.238.358.1
Avg.53.374.747.369.942.664.253.776.654.173.749.671.650.171.8
Traces Protected by RedAct
w/ Extracted SkillsEasy72.580.971.081.961.576.070.082.074.578.866.480.569.380.0
Medium43.267.238.661.328.462.843.872.442.568.937.463.039.065.9
Hard41.157.538.257.820.542.737.264.535.355.835.858.634.756.1
Avg.50.568.747.166.035.461.849.373.249.468.644.866.746.167.5
w/ Evolved SkillsEasy67.682.168.478.964.177.669.884.870.481.467.381.267.981.0
Medium41.170.435.064.426.859.640.069.640.566.036.965.236.765.9
Hard42.857.128.447.620.038.237.659.236.449.428.648.132.349.9
Avg.48.670.542.464.535.259.547.471.347.566.343.165.644.066.3
w/ Retrieval ReuseEasy66.881.767.978.063.576.568.983.869.780.866.780.767.280.2
Medium40.669.634.463.526.258.839.268.839.765.236.264.436.165.0
Hard42.456.627.847.019.737.837.158.435.948.828.147.631.849.4
Avg.48.069.941.863.634.758.846.670.446.865.642.564.943.465.5

Lower protected-trace scores indicate less transferable procedural utility from the released trace. RedAct pushes all three reuse channels to no higher than the no-skill baseline on average.

Behavioral watermarks provide provenance evidence

Standalone hooks produce the clearest detection signal after trajectory fine-tuning, while contextual hooks remain selective and introduce no false alarms in this evaluation. TD reports true detection rate, and FA reports false alarm rate.

TypeWatermarkQwen3-8BQwen3-4B
TDFATDFA
StandaloneEnv Check93.61.396.41.9
StandaloneRitual Marker100.00.099.80.0
ContextualCross Check18.50.016.40.0
ContextualError Anchoring28.30.032.20.0

Analysis

RedAct suppresses transfer

Raw traces expose procedural utility: extraction, evolution, and retrieval reuse reach 71.8-73.7% average step success. After RedAct rewriting, the same reuse methods fall to 65.5-67.5%. Normalized Skill Transfer drops across extraction, evolution, and retrieval reuse, often moving below the no-skill baseline.

Normalized skill transfer before and after RedAct protection across reuse methods and model backends

Protection removes residual leakage

After RedAct rewriting, NST becomes non-positive across extraction, evolution, and retrieval reuse. RPI also falls by 37-48%, indicating less recovered protected key information in downstream artifacts while preserving audit-critical execution evidence.

Protection summary showing NST and RPI reductions

Audit evidence stays intact

Protected traces retain 91.0-96.6% of final answers, tool names, verifier paths, and schema fields from raw traces, while removing 70.6% of protected key items. This supports trace release as an auditable artifact rather than an answer-only summary.

Release integrity after trace protection across audit preservation, operational stability, and key-item removal

Key-item guidance matters

Generic rewriting still leaves reusable signal in released traces. Explicit key-item localization pushes NST below the no-skill baseline across all three reuse methods and reduces recovered protected information in downstream artifacts.

Reuse MethodRewrite TypeSRSSRNSTRPI
Extracted SkillsKey-Item (Ours)50.568.7-36.620.4
Extracted SkillsGeneric55.676.951.627.6
Evolved SkillsKey-Item (Ours)48.670.5-17.221.1
Evolved SkillsGeneric52.874.222.629.4
Retrieval ReuseKey-Item (Ours)48.069.9-23.714.1
Retrieval ReuseGeneric57.972.32.215.7

Case Study

These demos compare raw and protected trajectories from the released benchmark traces. Raw trajectories highlight reusable key items in light red; protected trajectories highlight rewritten assistant text in light orange and injected watermark text in light purple.

BibTeX

@misc{xu2026redactredactingagentcapability,
      title={RedAct: Redacting Agent Capability Traces for Procedural Skill Protection}, 
      author={Shuwen Xu and Zhitao He and Yi R. Fung},
      year={2026},
      eprint={2606.10813},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2606.10813}, 
}