Skip to content

Conversation

@JiahangXu
Copy link
Contributor

No description provided.

@JiahangXu JiahangXu marked this pull request as ready for review December 16, 2025 13:54
Copilot AI review requested due to automatic review settings December 16, 2025 13:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Search-R1 documentation by replacing the placeholder "Evaluation" section with comprehensive benchmark results. The changes add concrete performance metrics comparing the original Search-R1 implementation against the Agent-Lightning version across multiple models and benchmarks.

Key Changes

  • Renamed section from "Evaluation" to "Benchmark Results"
  • Added description of seven diverse question-answering benchmarks (NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle)
  • Introduced performance comparison table showing results for Llama-3.2-3B, Qwen2.5-3B-Instruct, and Qwen2.5-7B-Instruct models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ultmaster ultmaster merged commit 52090e9 into main Dec 16, 2025
35 checks passed
@JiahangXu JiahangXu deleted the dev/search_r1_benchmark branch December 17, 2025 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants