Haibo Ding

Senior Applied Scientist & Science Manager · Agentic AI · LLMs · Evaluation

haibo-headshot.png

I am a Senior Applied Scientist and Science Manager at AWS AI Labs, where I conduct research and build products on agentic AI systems and large language models (LLMs), with a focus on tool-using agents and LLM/agent evaluation.

My research asks: How can we build agents that use tools effectively, while remaining measurable, reliable, and robust? This motivates work on tool selection and tool-use optimization, and on offline/online evaluation methods that measures LLM outputs, tool calls, and agent trajectories.

Current research topics include LLM / agent evaluation — training evaluator / judge models for trajectory- and tool-level assessment; designing offline (build-time) evaluation with metric selection, evaluation datasets, and automated evaluation pipelines; and developing online evaluation for efficient performance monitoring and continuous quality measurement. I also work on tool selection and optimization, including improving tool selection accuracy, tool retrieval modeling, and tool description optimization.

Background: Previously, I was a Senior Research Scientist at Bosch Research, where I built ML/NLP solutions for dialogue sentiment analysis, customer service understanding, document understanding, and knowledge extraction. I received my Ph.D. in Computer Science from the University of Utah, where I worked on semi-supervised learning and natural language processing.

News

Dec 03, 2025 Open-sourced Agent-EvalKit — an AI assistant toolkit for build-time agent evaluation.
Dec 02, 2025 Launched Amazon Bedrock AgentCore Evaluations (preview) for agent performance monitoring.
Aug 04, 2025 Organized the KDD Workshop on Automatic Prompt Optimization

Selected Publications

  1. ArXiv
    Diffusion Language Model Inference with Monte Carlo Tree Search
    Zheng Huang, Kiran Ramnath, Yueyan Chen, and 8 more authors
    ArXiv, 2025
  2. ArXiv
    Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation
    Zhichao Xu, Zongyu Wu, Yun Zhou, and 9 more authors
    ArXiv, 2025
  3. EMNLP
    SLOT: Structuring the Output of Large Language Models
    Darren Yow-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, and 3 more authors
    ArXiv, 2025
  4. TACL
    How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
    Zhengbao Jiang, Jun Araki, Haibo Ding, and 1 more author
    Transactions of the Association for Computational Linguistics, 2021