Haibo Ding

I am a Senior Applied Scientist and Science Manager at AWS AI Labs, where I conduct research and build products on agentic AI systems and large language models (LLMs), with a focus on tool-using agents and LLM/agent evaluation.

My research asks: How can we build agents that use tools effectively, while remaining measurable, reliable, and robust? This motivates work on tool selection and tool-use optimization, and on offline/online evaluation methods that measures LLM outputs, tool calls, and agent trajectories.

Current research topics include LLM / agent evaluation — training evaluator / judge models for trajectory- and tool-level assessment; designing offline (build-time) evaluation with metric selection, evaluation datasets, and automated evaluation pipelines; and developing online evaluation for efficient performance monitoring and continuous quality measurement. I also work on tool selection and optimization, including improving tool selection accuracy, tool retrieval modeling, and tool description optimization.

Background: Previously, I was a Senior Research Scientist at Bosch Research, where I built ML/NLP solutions for dialogue sentiment analysis, customer service understanding, document understanding, and knowledge extraction. I received my Ph.D. in Computer Science from the University of Utah, where I worked on semi-supervised learning and natural language processing.

News

Dec 03, 2025	Open-sourced Agent-EvalKit — an AI assistant toolkit for build-time agent evaluation.
Dec 02, 2025	Launched Amazon Bedrock AgentCore Evaluations (preview) for agent performance monitoring.
Aug 04, 2025	Organized the KDD Workshop on Automatic Prompt Optimization

Selected Publications

ArXiv

Diffusion Language Model Inference with Monte Carlo Tree Search

Zheng Huang, Kiran Ramnath, Yueyan Chen, and 8 more authors

ArXiv, 2025

PDF
ArXiv

Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

Zhichao Xu, Zongyu Wu, Yun Zhou, and 9 more authors

ArXiv, 2025

PDF
EMNLP

SLOT: Structuring the Output of Large Language Models

Darren Yow-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, and 3 more authors

ArXiv, 2025

PDF
TACL

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Zhengbao Jiang, Jun Araki, Haibo Ding, and 1 more author

Transactions of the Association for Computational Linguistics, 2021

DOI PDF