LLM Agents
Agent Evaluation and Safety — Trust but Verify
Agents that take actions in the world require rigorous evaluation and safety measures. A single wrong action can have real consequences. This guide covers how to measure, test, and constrain agent behavior.
- Success Metrics — Measuring task completion and quality
- Failure Modes — Understanding how agents go wrong
- Safety Guardrails — Constraining agent actions to prevent harm
With great power comes great responsibility.
Agent Evaluation and Safety
LLM agents that interact with external systems pose unique safety challenges. Unlike pure text generation, agent actions can have irreversible consequences — deleting files, sending emails, making purchases, or accessing sensitive data.
DfAgent Evaluation
Agent evaluation measures an agent's ability to successfully complete tasks, its efficiency in doing so, and the safety of its actions. This includes task success rate, action accuracy, efficiency metrics, and safety compliance.
Evaluation Metrics
Task Success Metrics
Task Success Rate
Here,
- =Success rate
Action Accuracy
Here,
- =Action accuracy
class AgentEvaluator:
def __init__(self, test_suite):
self.test_suite = test_suite
def evaluate(self, agent):
results = []
for task in self.test_suite:
# Run agent on task
trajectory = agent.run(task)
# Evaluate success
success = self.check_success(task, trajectory)
# Evaluate efficiency
efficiency = self.measure_efficiency(trajectory)
# Evaluate safety
safety = self.check_safety(trajectory)
results.append({
"task": task,
"success": success,
"efficiency": efficiency,
"safety": safety,
"steps": len(trajectory),
"tools_used": self.get_tools_used(trajectory)
})
return self.summarize_results(results)
def check_success(self, task, trajectory):
"""Check if the task was completed successfully."""
final_state = trajectory[-1]
return task.goal_reached(final_state)
def measure_efficiency(self, trajectory):
"""Measure how efficiently the task was completed."""
total_steps = len(trajectory)
redundant_steps = self.count_redundant_steps(trajectory)
return 1.0 - (redundant_steps / max(total_steps, 1))
Trajectory Metrics
DfTrajectory Evaluation
Trajectory evaluation examines the sequence of actions taken by the agent, not just the final outcome. This reveals inefficient patterns, unnecessary detours, and potential safety issues.
| Metric | Description | Ideal |
|---|---|---|
| Step count | Total actions taken | Minimum necessary |
| Tool calls | Number of external tool invocations | Minimal |
| Redundant actions | Actions that don't progress toward goal | 0 |
| Recovery rate | Successful recoveries from errors | High |
| Human interventions | Times human help was needed | 0 |
Failure Modes
Common Agent Failures
DfAgent Failure Modes
Agent failure modes are systematic ways agents can fail: (1) Hallucinated actions — calling non-existent tools, (2) Parameter errors — wrong arguments to tools, (3) Infinite loops — repeating the same failed action, (4) Goal drift — losing focus on the original task, (5) Safety violations — taking harmful actions.
class FailureAnalyzer:
def analyze_failures(self, trajectories):
"""Categorize and analyze failure patterns."""
failures = {
"hallucinated_action": [],
"parameter_error": [],
"infinite_loop": [],
"goal_drift": [],
"safety_violation": []
}
for traj in trajectories:
for action in traj.actions:
if action.tool not in available_tools:
failures["hallucinated_action"].append(action)
elif not validate_params(action):
failures["parameter_error"].append(action)
if self.detect_loop(traj):
failures["infinite_loop"].append(traj)
if self.detect_goal_drift(traj):
failures["goal_drift"].append(traj)
if self.detect_safety_violation(traj):
failures["safety_violation"].append(traj)
return failures
Safety Guardrails
Action Filtering
DfAction Guardrails
Action guardrails filter or block agent actions before they are executed. They enforce safety constraints like preventing destructive operations, limiting API access, and requiring confirmation for high-impact actions.
class SafetyGuardrails:
def __init__(self, config):
self.blocked_actions = config.blocked_actions
self.require_confirmation = config.require_confirmation
self.rate_limits = config.rate_limits
def check_action(self, action, context):
"""Check if an action is safe to execute."""
# Check blocked actions
if action.tool in self.blocked_actions:
return False, f"Action '{action.tool}' is blocked"
# Check confirmation requirements
if action.tool in self.require_confirmation:
if not self.get_confirmation(action):
return False, "User denied confirmation"
# Check rate limits
if self.exceeds_rate_limit(action):
return False, "Rate limit exceeded"
# Check parameter safety
if not self.validate_parameters(action):
return False, "Invalid parameters"
return True, "Action approved"
Sandboxing
DfAgent Sandboxing
Agent sandboxing executes agent actions in an isolated environment with limited permissions. This prevents agents from making irreversible changes to production systems.
class AgentSandbox:
def __init__(self, permissions):
self.permissions = permissions
self.action_log = []
def execute(self, action):
"""Execute an action with sandbox restrictions."""
# Check permissions
if not self.has_permission(action):
return {"error": "Permission denied"}
# Execute in sandbox
try:
if action.tool == "file_write":
result = self.sandboxed_file_write(action.input)
elif action.tool == "execute_code":
result = self.sandboxed_code_execution(action.input)
elif action.tool == "api_call":
result = self.sandboxed_api_call(action.input)
else:
result = execute_tool(action)
self.action_log.append({"action": action, "result": result})
return result
except Exception as e:
return {"error": str(e)}
def sandboxed_code_execution(self, code):
"""Execute code with restrictions."""
# Use a restricted Python environment
# No network access, limited file system, timeout
with RestrictedPython(code, timeout=30) as sandbox:
return sandbox.run()
Red Teaming for Agents
Adversarial Testing
class AgentRedTeam:
def __init__(self, agent, attack_strategies):
self.agent = agent
self.attacks = attack_strategies
def run_red_team(self, num_tests=100):
"""Run adversarial tests on the agent."""
results = []
for i in range(num_tests):
# Select attack strategy
attack = random.choice(self.attacks)
# Generate adversarial input
adversarial_input = attack.generate_input()
# Run agent
trajectory = self.agent.run(adversarial_input)
# Check for vulnerabilities
vulnerabilities = self.check_vulnerabilities(trajectory)
results.append({
"attack": attack.name,
"input": adversarial_input,
"vulnerabilities": vulnerabilities,
"severity": self.assess_severity(vulnerabilities)
})
return results
def check_vulnerabilities(self, trajectory):
"""Check for security vulnerabilities in agent behavior."""
vulns = []
for action in trajectory.actions:
# Check for prompt injection
if self.detect_prompt_injection(action):
vulns.append("prompt_injection")
# Check for data exfiltration
if self.detect_data_exfiltration(action):
vulns.append("data_exfiltration")
# Check for privilege escalation
if self.detect_privilege_escalation(action):
vulns.append("privilege_escalation")
return vulns
Evaluation Framework
class AgentEvaluationFramework:
def __init__(self, agent, config):
self.agent = agent
self.config = config
self.evaluator = AgentEvaluator(config.test_suite)
self.guardrails = SafetyGuardrails(config.safety)
self.red_team = AgentRedTeam(agent, config.attacks)
def full_evaluation(self):
"""Run comprehensive evaluation."""
# Task performance
performance = self.evaluator.evaluate(self.agent)
# Safety testing
safety = self.red_team.run_red_team()
# Efficiency metrics
efficiency = self.measure_efficiency()
return {
"performance": performance,
"safety": safety,
"efficiency": efficiency,
"overall_score": self.compute_overall_score(
performance, safety, efficiency
)
}
Practice Exercises
-
Task Evaluation: Design a test suite of 20 tasks for a web-browsing agent. Include simple lookups, multi-step research, and tasks requiring tool use.
-
Failure Analysis: Analyze 50 agent trajectories to identify the most common failure modes. What percentage of failures are recoverable?
-
Safety Guardrails: Implement guardrails for an agent that can send emails and access files. What actions should require confirmation?
-
Red Teaming: Design 5 adversarial attacks against a code-writing agent. Test each attack and document vulnerabilities.
Key Takeaways
Summary: Agent Evaluation and Safety
- Task success rate measures whether agents complete their goals
- Trajectory evaluation examines the path taken, not just the outcome
- Common failures: hallucinated actions, parameter errors, infinite loops, goal drift
- Action guardrails filter or block unsafe actions before execution
- Sandboxing isolates agent execution from production systems
- Red teaming tests agents against adversarial attacks
- Rate limiting prevents runaway agent behavior
- Confirmation requirements add human oversight for high-impact actions
What to Learn Next
-> LLM Safety and Red Teaming Safety testing for language models.
-> LLM Agent Frameworks Building autonomous agents with LLMs.
-> Constitutional AI Using AI feedback for alignment.
-> Tool Use and Function Calling Teaching LLMs to use external tools.
-> Building Production LLM Applications End-to-end production systems.
-> ML Ethics Ethical considerations in ML systems.