MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Abstract
MobileBench is a scalable benchmark for evaluating LLM-based route-planning agents in real-world scenarios, featuring anonymized user queries and a deterministic sandbox for reproducible testing.
Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .
Community
A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
MobilityBench is a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. It is built from large-scale, anonymized user queries collected from Amap, covering a wide range of route-planning intents across multiple cities worldwide.
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/mobilitybench-a-benchmark-for-evaluating-route-planning-agents-in-real-world-mobility-scenarios-2338-a211e2a7
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation (2026)
- TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios (2026)
- Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios (2026)
- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents (2026)
- WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints (2026)
- AMAP Agentic Planning Technical Report (2025)
- MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper