Agent Safety: Building Guardrails
Agents with tools can cause real damage. Here's how to build safety into every layer of your agent system.
Why Safety Matters
An agent with email access can:
- Spam your contacts
- Send incorrect information
- Leak sensitive data
Guardrails prevent disasters.
Layers of Protection
1. Input Validation
- Sanitize user inputs
- Reject malicious prompts
- Limit request complexity
2. Tool Restrictions
- Whitelist allowed operations
- Require confirmation for destructive actions
- Limit rate of actions
3. Output Filtering
- Scan for sensitive data leakage
- Validate format before sending
- Log all outputs for audit
4. Human Oversight
- Review mode for high-risk actions
- Easy override/stop mechanisms
- Alerts for unusual behavior
Implementing Confirmations
async function sendEmail(to, subject, body) {
if (isDestructive(to, subject, body)) {
const confirmed = await askUser(
`Send email to ${to}?`
);
if (!confirmed) return "Cancelled";
}
// proceed with sending
}
Monitoring & Alerts
- Rate limits — Alert if agent sends 100+ messages/hour
- Unusual patterns — Flag unexpected tool combinations
- Error rates — Investigate high failure rates
The Feedback Loop
When agents make mistakes:
- Log the error with context
- Store in feedback file
- Agent reads feedback before future actions
- Prevents repeating same mistakes