DynaGuard: A Dynamic Guardrail Model With User-Defined Policies.

Abstract

Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.

Try it yourself!

Experience DynaGuard's real-time policy enforcement with custom rules.

Why Dynamic Guardians?

Dynamic: Accept arbitrary, user‑defined rules (policies) instead of fixed harm categories.
Accurate: Preserve strong performance on standard safety ontologies.
Interpretable: Output actionable, natural‑language explanations for any rule violations.
Fast option: Provide a low‑latency classification path without unnecessary reasoning overhead.

Guardian model protecting an assistant using a policy at runtime

Figure 1 (concept): A guardian model steers assistant behavior using user‑defined policies.

The Compliance Dataset

To train and evaluate dynamic guardians, we release Compliance, a dataset of 60K multi‑turn dialogues paired with diverse guardrail policies. Policies span categories beyond traditional toxicity (e.g., User Experience, Regulatory Compliance, Content Controls, Transactions, Agentic Tasks), enabling robust generalization to domain‑specific rules.

60K

Multi-turn dialogues

8

Policy categories

1000s

High-clarity rules

Policy Categories ▼

User Experience: Rules ensuring positive interactions
Regulatory Compliance: Legal and regulatory requirements
Content Controls: Content filtering and restrictions
Transactions: Financial and commercial guidelines
Agentic Tasks: AI agent behavior controls

Policy violation categories visualization

Figure 2: Example policy‑violation categories represented in Compliance.

DynaGuard Model

DynaGuard is a guardian model fine‑tuned to enforce arbitrary policies and supply concise justifications. It supports a dual‑mode interface: (1) a lightweight classification path and (2) a reasoning‑augmented path that emits explanations the assistant can use to self‑correct.

Key Features ▼

Outperforms prior dedicated guardian models on user‑defined harms
Maintains strong results on standard safety ontologies
Offers low‑latency inference and controllable sensitivity
Provides natural language explanations for violations

Selected Results

Performance Highlights ▼

Frontier‑level accuracy, smaller footprint: An open‑source ~4B parameter DynaGuard model achieves high accuracy on Compliance, surpassing a general‑purpose API model baseline (e.g., GPT‑4o) while being orders of magnitude cheaper and faster.
Hard benchmarks: Models trained on Compliance generalize to OOD policies and multi‑hop cases where earlier guardian approaches struggle.

F1 scores (%) for existing Safety benchmarks and DynaBench.

Table 1: Illustrative performance—policy‑violation F1 comparisons.

Reference

BibTeX

@article{hoover2025dynaguard,
    title={DynaGuard: A Dynamic Guardrail Model With User-Defined Policies}, 
    author={Monte Hoover and Vatsal Baherwani and Neel Jain and Khalid Saifullah and Joseph Vincent and Chirag Jain and Melissa Kazemi Rad and C. Bayan Bruss and Ashwinee Panda and Tom Goldstein},
    journal={arXiv preprint},
    year={2025},
    url={https://arxiv.org/abs/2509.02563}, 
}