When Agentic Workflows Help: Hybrid Retrieval for Test Case Recommendation over Heterogeneous Software Artifacts

Download PDF

Benchmark Highlights

Evaluation scenarios

Hand-built retrieval tasks stratified by family, complexity, and answerability.

11,366

Production test cases

Semi-structured test cases with free text, instructions, metadata, and trace links.

3 metric families

Evaluation signals

Functional checks, rank-aware IR metrics, and LLM-as-a-judge response-quality scores.

Controlled setups

Three models compared across ReAct, Orchestrator-worker, full-prompt, and skill-based setups.

Interactive Benchmark Results

Results over 52 industrial retrieval scenarios. The charts compare ReAct and Orchestrator-worker across full-prompt and skill-based setups.

Prompt / skill setup

All prompt/skill setups. All three prompt/skill setups are shown together.

Leaderboard across all prompt/skill setups

Thin whiskers show an approximate 95% binomial interval for the 52 scenario pass-rate estimate.

Configuration	Functional pass0-100%, higher is better	IR NDCG@100-1, higher is better	LLM judge1-5, higher is better	Timelower is better	Toolslower is better
GLM-5.1Orchestrator-workerFull prompt	90.4%	0.597	4.43	167s	36
DeepSeek-V3.2Orchestrator-workerPrompt + skills	88.5%	0.733	4.32	437s	48
GLM-5.1Orchestrator-workerGuided skills	88.5%	0.660	4.51	195s	30
DeepSeek-V3.2ReActFull prompt	84.6%	0.437	4.33	115s	12
GLM-5.1Orchestrator-workerPrompt + skills	82.7%	0.695	4.28	202s	47
DeepSeek-V3.2Orchestrator-workerGuided skills	82.7%	0.692	4.41	416s	40
DeepSeek-V3.2ReActGuided skills	80.8%	0.464	4.13	151s	14
DeepSeek-V3.2ReActPrompt + skills	80.8%	0.432	4.16	122s	14
DeepSeek-V3.2Orchestrator-workerFull prompt	76.9%	0.615	4.28	520s	46
GLM-5.1ReActFull prompt	76.9%	0.613	4.11	43s	6

Metric-family leaders

These metrics are shown separately because they answer different questions: task correctness, retrieval ranking, and answer quality.

Functional pass

90.4%

GLM-5.1, Orchestrator-worker, Full prompt

0-100%, higher is better. Deterministic expected-artifact and entity checks.

IR metric

0.733

DeepSeek-V3.2, Orchestrator-worker, Prompt + skills

NDCG@10 ranges from 0 to 1, higher is better. 1.000 means expected artifacts are ranked at the top.

LLM-as-judge

4.51

GLM-5.1, Orchestrator-worker, Guided skills

Mean 1-5 score, higher is better, across response-quality dimensions.

Cost-performance trade-off

Each point is one model/workflow pair for the selected prompt/skill setup.

Functional pass uses a 0-100% scale where higher is better. The cost axis range shown is 0-572s, where lower is better. The shaded high-pass band starts at 75%; green is the lower-cost half and orange is the higher-cost half.

high pass, lower costhigh pass, higher cost

Scenario families

The 52 scenarios were deliberately stratified across retrieval families instead of sampled as one undifferentiated pool.

Pass is shown on a 0-100% scale. IR is NDCG@10 on a 0-1 scale. Higher is better for both.

Traceability

16 scenarios

94.1% pass0.391 IR

Follow explicit links between procedures, requirements, backlog items, and test cases.

14 scenarios

54.0% pass0.325 IR

Find relevant tests from natural-language descriptions where wording does not match cleanly.

Lookup

10 scenarios

95.0% pass0.762 IR

Resolve known entities or identifiers to the relevant test artifacts and evidence.

Impact

8 scenarios

66.7% pass0.246 IR

Estimate which tests are affected by a changed component or related artifact.

Comparison

4 scenarios

66.7% pass0.838 IR

Compare related artifact sets and explain overlaps, gaps, or differences.

Each scenario includes expected artifacts, expected entity references, a reference answer, and an answerability label.

Complexity strata

Single-hopMulti-hopReasoning

Abstract

Authors

Berkay Orhan

Benchmark Highlights

Evaluation scenarios

Hand-built retrieval tasks stratified by family, complexity, and answerability.

11,366

Production test cases

Semi-structured test cases with free text, instructions, metadata, and trace links.

3 metric families

Evaluation signals

Functional checks, rank-aware IR metrics, and LLM-as-a-judge response-quality scores.

Controlled setups

Three models compared across ReAct, Orchestrator-worker, full-prompt, and skill-based setups.

Configuration

Functional pass0-100%, higher is better

IR NDCG@100-1, higher is better

LLM judge1-5, higher is better

Timelower is better

Toolslower is better

GLM-5.1Orchestrator-workerFull prompt

90.4%

0.597

4.43

167s

DeepSeek-V3.2Orchestrator-workerPrompt + skills

88.5%

0.733

4.32

437s

GLM-5.1Orchestrator-workerGuided skills

88.5%

0.660

4.51

195s

DeepSeek-V3.2ReActFull prompt

84.6%

0.437

4.33

115s

GLM-5.1Orchestrator-workerPrompt + skills

82.7%

0.695

4.28

202s

DeepSeek-V3.2Orchestrator-workerGuided skills

82.7%

0.692

4.41

416s

DeepSeek-V3.2ReActGuided skills

80.8%

0.464

4.13

151s

DeepSeek-V3.2ReActPrompt + skills

80.8%

0.432

4.16

122s

DeepSeek-V3.2Orchestrator-workerFull prompt

76.9%

0.615

4.28

520s

GLM-5.1ReActFull prompt

76.9%

0.613

4.11

43s