File size: 5,373 Bytes
621894b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1656bf
 
621894b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1656bf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: mit
language: en
pipeline_tag: text-generation
tags:
- conversational
- text-generation
- medical
- diagnosis
- agent
- reinforcement-learning
base_model: Qwen3-8B
datasets:
- HealthBench
- MAQuE
- MedQA
- MMLU
paper: 2510.04284
model_name: Doctor-R1
metrics:
- accuracy
---

# Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

**Doctor-R1** is an AI doctor agent trained to conduct strategic, multi-turn patient inquiries to guide its diagnostic decision-making. Unlike traditional models that excel at static medical QA, Doctor-R1 is designed to master the complete, dynamic consultation process, unifying the two core skills of a human physician: communication and decision-making.

This model is an 8B parameter agent built upon **Qwen3-8B** and fine-tuned using a novel **Experiential Agentic Reinforcement Learning** framework.

## ✨ Key Features

* **Unified Clinical Skills:** The first agent framework to holistically integrate two core clinical skills, **strategic patient inquiry** and **accurate medical decision-making** within a single model.
* **Experiential Reinforcement Learning:** A novel closed-loop framework where the agent learns and improves from an accumulating repository of its own high-quality experiences.
* **Dual-Competency Reward System:** A sophisticated two-tiered reward architecture that separately optimizes for both conversational quality (soft skills) and diagnostic accuracy (hard skills), featuring a "safety-first" veto system.
* **State-of-the-Art Performance:** Outperforms leading open-source models on challenging dynamic benchmarks like HealthBench and MAQuE with high parameter efficiency (8B).

## 🏆 Leaderboards

Doctor-R1 demonstrates state-of-the-art performance among open-source models and surpasses several powerful proprietary models on HealthBench. It demonstrates superior performance on dynamic benchmarks and strong foundational knowledge on static QA tasks.

| Benchmark          | Key Metric | Doctor-R1 | Best Open-Source (>=32B) |
| :----------------- | :--------- | :-------: | :----------------------: |
| **HealthBench**    | Avg. Score | **36.29** |          33.16           |
| **MAQuE**          | Accuracy   | **60.00** |          57.00           |
| **MedQA**          | Accuracy   | **83.50** |          81.50           |
| **MMLU (Medical)** | Accuracy   | **85.00** |          84.00           |

The detailed breakdown of **HealthBench Main (Dynamic Consultation)** is as below:

| Model                     | Avg. Score | Accuracy  | Comm. Quality | Context Aware. |
| :------------------------ | :--------: | :-------: | :-----------: | :------------: |
| **GPT-o3** (Proprietary)  |   38.91    |   40.31   |     64.78     |     48.09      |
| **Doctor-R1 (8B)**        | **36.29**  | **37.84** |   **64.15**   |   **49.24**    |
| Baichuan-M2-32B           |   33.16    |   33.95   |     58.01     |     46.80      |
| Grok-4 (Proprietary)      |   33.03    |   37.95   |     61.35     |     45.62      |
| GPT-4.1 (Proprietary)     |   31.18    |   34.78   |     60.65     |     44.81      |
| UltraMedical-8B           |   22.19    |   25.50   |     57.40     |     40.26      |
| **Base Model (Qwen3-8B)** |   25.13    |   28.57   |     49.35     |     43.00      |



## 👥 Human Evaluation

To validate that our quantitative results align with user experience, we conducted a pairwise human preference evaluation against other leading models. The results show a decisive preference for Doctor-R1, especially in patient-centric metrics.

![](assets/human.png)



## 🔬 Ablation Studies

Our ablation studies validate the critical contributions of our framework's key components.

***Impact of Experience Retrieval Mechanism.*** The results show that our full retrieval mechanism with reward and novelty filtering provides a significant performance boost over both a no-experience baseline and a standard similarity-based retrieval, especially in communication skills.

<p align="center">
  <img src="assets/radar_exp.jpg" style="width:60%;" />
</p>

***Impact of Patient Agent Scaling.*** We observe a strong, positive correlation between the number of simulated patient interactions during training and the agent's final performance. This validates that our agentic framework effectively learns and improves from a large volume of diverse experiences.

![](assets/patient_scaling.png)




## 📜 Citation

If you find our work useful in your research, please consider citing our paper:

```bibtex
@misc{lai2025doctorr1masteringclinicalinquiry,
      title={Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning}, 
      author={Yunghwei Lai and Kaiming Liu and Ziyue Wang and Weizhi Ma and Yang Liu},
      year={2025},
      eprint={2510.04284},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.04284}, 
}

```



## 💬 Contact & Questions

For collaborations or inquiries, please contact [**[email protected]**](mailto:[email protected]). You’re also welcome to open an issue or join the discussion in this repository, we value your insights and contributions to **Doctor-R1**.

Stay tuned and join our community as we push the boundaries of intelligent healthcare. Together, let’s make medical AI safer, smarter, and more human. 🤝