Loading...
Loading...
52
Evaluation scenarios
Hand-built retrieval tasks stratified by family, complexity, and answerability.
11,366
Production test cases
Semi-structured test cases with free text, instructions, metadata, and trace links.
3 metric families
Evaluation signals
Functional checks, rank-aware IR metrics, and LLM-as-a-judge response-quality scores.
18
Controlled setups
Three models compared across ReAct, Orchestrator-worker, full-prompt, and skill-based setups.
Results over 52 industrial retrieval scenarios. The charts compare ReAct and Orchestrator-worker across full-prompt and skill-based setups.
All prompt/skill setups. All three prompt/skill setups are shown together.
Thin whiskers show an approximate 95% binomial interval for the 52 scenario pass-rate estimate.
| Configuration | Functional pass0-100%, higher is better | IR NDCG@100-1, higher is better | LLM judge1-5, higher is better | Timelower is better | Toolslower is better |
|---|---|---|---|---|---|
GLM-5.1Orchestrator-workerFull prompt | 90.4% | 0.597 | 4.43 | 167s | 36 |
DeepSeek-V3.2Orchestrator-workerPrompt + skills | 88.5% | 0.733 | 4.32 | 437s | 48 |
GLM-5.1Orchestrator-workerGuided skills | 88.5% | 0.660 | 4.51 | 195s | 30 |
DeepSeek-V3.2ReActFull prompt | 84.6% | 0.437 | 4.33 | 115s | 12 |
GLM-5.1Orchestrator-workerPrompt + skills | 82.7% | 0.695 | 4.28 | 202s | 47 |
DeepSeek-V3.2Orchestrator-workerGuided skills | 82.7% | 0.692 | 4.41 | 416s | 40 |
DeepSeek-V3.2ReActGuided skills | 80.8% | 0.464 | 4.13 | 151s | 14 |
DeepSeek-V3.2ReActPrompt + skills | 80.8% | 0.432 | 4.16 | 122s | 14 |
DeepSeek-V3.2Orchestrator-workerFull prompt | 76.9% | 0.615 | 4.28 | 520s | 46 |
GLM-5.1ReActFull prompt | 76.9% | 0.613 | 4.11 | 43s | 6 |
These metrics are shown separately because they answer different questions: task correctness, retrieval ranking, and answer quality.
Functional pass
90.4%
GLM-5.1, Orchestrator-worker, Full prompt
0-100%, higher is better. Deterministic expected-artifact and entity checks.
IR metric
0.733
DeepSeek-V3.2, Orchestrator-worker, Prompt + skills
NDCG@10 ranges from 0 to 1, higher is better. 1.000 means expected artifacts are ranked at the top.
LLM-as-judge
4.51
GLM-5.1, Orchestrator-worker, Guided skills
Mean 1-5 score, higher is better, across response-quality dimensions.
Each point is one model/workflow pair for the selected prompt/skill setup.
Functional pass uses a 0-100% scale where higher is better. The cost axis range shown is 0-572s, where lower is better. The shaded high-pass band starts at 75%; green is the lower-cost half and orange is the higher-cost half.
The 52 scenarios were deliberately stratified across retrieval families instead of sampled as one undifferentiated pool.
Pass is shown on a 0-100% scale. IR is NDCG@10 on a 0-1 scale. Higher is better for both.
Traceability
16 scenarios
Follow explicit links between procedures, requirements, backlog items, and test cases.
Search
14 scenarios
Find relevant tests from natural-language descriptions where wording does not match cleanly.
Lookup
10 scenarios
Resolve known entities or identifiers to the relevant test artifacts and evidence.
Impact
8 scenarios
Estimate which tests are affected by a changed component or related artifact.
Comparison
4 scenarios
Compare related artifact sets and explain overlaps, gaps, or differences.
Each scenario includes expected artifacts, expected entity references, a reference answer, and an answerability label.
Complexity strata