Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios

1Huazhong University of Science and Technology,
2Shanghai Jiaotong, 3Duke University, 4Lehigh University

*Indicates Equal Contribution
MY ALT TEXT

Overview of AutoSafe, which consists of the following steps: 1) Generate risk scenario dataset Dr based on predefined available tools F and risk outcomes O (including: user instructions u and historical interaction trajectories &tau) 2) Sample safe actions ast from the environment based on a self-reflection mechanism and construct a dataset Dsafe. 3) Fine-tune the LLM using this dataset.

Abstract

Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset—eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment.

Main Results

Examples

Poster

BibTeX

@misc{zhou2025automatingsafetyenhancementllmbased,
      title={Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios},
      author={Xueyang Zhou and Weidong Wang and Lin Lu and Jiawen Shi and Guiyao Tie and Yongtian Xu and Lixing Chen and Pan Zhou and Neil Zhenqiang Gong and Lichao Sun},
      year={2025},
      eprint={2505.17735},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17735},
}