Project Page

DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

1 University of Maryland 2 Capital One

Abstract

Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.

Try it yourself!

Experience DynaGuard's real-time policy enforcement with custom rules.

Why Dynamic Guardians?

  • Dynamic: Accept arbitrary, user‑defined rules (policies) instead of fixed harm categories.
  • Accurate: Preserve strong performance on standard safety ontologies.
  • Interpretable: Output actionable, natural‑language explanations for any rule violations.
  • Fast option: Provide a low‑latency classification path without unnecessary reasoning overhead.

The Compliance Dataset

To train and evaluate dynamic guardians, we release Compliance, a dataset of 60K multi‑turn dialogues paired with diverse guardrail policies. Policies span categories beyond traditional toxicity (e.g., User Experience, Regulatory Compliance, Content Controls, Transactions, Agentic Tasks), enabling robust generalization to domain‑specific rules.

60K
Multi-turn dialogues
8
Policy categories
1000s
High-clarity rules
Policy Categories ▼
  • User Experience: Rules ensuring positive interactions
  • Regulatory Compliance: Legal and regulatory requirements
  • Content Controls: Content filtering and restrictions
  • Transactions: Financial and commercial guidelines
  • Agentic Tasks: AI agent behavior controls

DynaGuard Model

DynaGuard is a guardian model fine‑tuned to enforce arbitrary policies and supply concise justifications. It supports a dual‑mode interface: (1) a lightweight classification path and (2) a reasoning‑augmented path that emits explanations the assistant can use to self‑correct.

Key Features ▼
  • Outperforms prior dedicated guardian models on user‑defined harms
  • Maintains strong results on standard safety ontologies
  • Offers low‑latency inference and controllable sensitivity
  • Provides natural language explanations for violations

Selected Results

Performance Highlights ▼
  • Frontier‑level accuracy, smaller footprint: An open‑source ~4B parameter DynaGuard model achieves high accuracy on Compliance, surpassing a general‑purpose API model baseline (e.g., GPT‑4o) while being orders of magnitude cheaper and faster.
  • Hard benchmarks: Models trained on Compliance generalize to OOD policies and multi‑hop cases where earlier guardian approaches struggle.

Reference

BibTeX
@article{hoover2025dynaguard,
    title={DynaGuard: A Dynamic Guardrail Model With User-Defined Policies}, 
    author={Monte Hoover and Vatsal Baherwani and Neel Jain and Khalid Saifullah and Joseph Vincent and Chirag Jain and Melissa Kazemi Rad and C. Bayan Bruss and Ashwinee Panda and Tom Goldstein},
    journal={arXiv preprint},
    year={2025},
    url={https://arxiv.org/abs/2509.02563}, 
}