CW

Agent Evaluation and Safety

LLM AgentsAgent SafetyFree Lesson

Advertisement

LLM Agents

Agent Evaluation and Safety — Trust but Verify

Agents that take actions in the world require rigorous evaluation and safety measures. A single wrong action can have real consequences. This guide covers how to measure, test, and constrain agent behavior.

  • Success Metrics — Measuring task completion and quality
  • Failure Modes — Understanding how agents go wrong
  • Safety Guardrails — Constraining agent actions to prevent harm

With great power comes great responsibility.

Agent Evaluation and Safety

LLM agents that interact with external systems pose unique safety challenges. Unlike pure text generation, agent actions can have irreversible consequences — deleting files, sending emails, making purchases, or accessing sensitive data.

DfAgent Evaluation

Agent evaluation measures an agent's ability to successfully complete tasks, its efficiency in doing so, and the safety of its actions. This includes task success rate, action accuracy, efficiency metrics, and safety compliance.

Evaluation Metrics

Task Success Metrics

Task Success Rate

SR={successfully completed tasks}{total tasks}SR = \frac{|\{\text{successfully completed tasks}\}|}{|\{\text{total tasks}\}|}

Here,

  • SRSR=Success rate

Action Accuracy

AA={correct actions}{total actions}AA = \frac{|\{\text{correct actions}\}|}{|\{\text{total actions}\}|}

Here,

  • AAAA=Action accuracy
class AgentEvaluator:
    def __init__(self, test_suite):
        self.test_suite = test_suite
    
    def evaluate(self, agent):
        results = []
        for task in self.test_suite:
            # Run agent on task
            trajectory = agent.run(task)
            
            # Evaluate success
            success = self.check_success(task, trajectory)
            
            # Evaluate efficiency
            efficiency = self.measure_efficiency(trajectory)
            
            # Evaluate safety
            safety = self.check_safety(trajectory)
            
            results.append({
                "task": task,
                "success": success,
                "efficiency": efficiency,
                "safety": safety,
                "steps": len(trajectory),
                "tools_used": self.get_tools_used(trajectory)
            })
        
        return self.summarize_results(results)
    
    def check_success(self, task, trajectory):
        """Check if the task was completed successfully."""
        final_state = trajectory[-1]
        return task.goal_reached(final_state)
    
    def measure_efficiency(self, trajectory):
        """Measure how efficiently the task was completed."""
        total_steps = len(trajectory)
        redundant_steps = self.count_redundant_steps(trajectory)
        return 1.0 - (redundant_steps / max(total_steps, 1))

Trajectory Metrics

DfTrajectory Evaluation

Trajectory evaluation examines the sequence of actions taken by the agent, not just the final outcome. This reveals inefficient patterns, unnecessary detours, and potential safety issues.

MetricDescriptionIdeal
Step countTotal actions takenMinimum necessary
Tool callsNumber of external tool invocationsMinimal
Redundant actionsActions that don't progress toward goal0
Recovery rateSuccessful recoveries from errorsHigh
Human interventionsTimes human help was needed0

Failure Modes

Common Agent Failures

DfAgent Failure Modes

Agent failure modes are systematic ways agents can fail: (1) Hallucinated actions — calling non-existent tools, (2) Parameter errors — wrong arguments to tools, (3) Infinite loops — repeating the same failed action, (4) Goal drift — losing focus on the original task, (5) Safety violations — taking harmful actions.

class FailureAnalyzer:
    def analyze_failures(self, trajectories):
        """Categorize and analyze failure patterns."""
        failures = {
            "hallucinated_action": [],
            "parameter_error": [],
            "infinite_loop": [],
            "goal_drift": [],
            "safety_violation": []
        }
        
        for traj in trajectories:
            for action in traj.actions:
                if action.tool not in available_tools:
                    failures["hallucinated_action"].append(action)
                elif not validate_params(action):
                    failures["parameter_error"].append(action)
            
            if self.detect_loop(traj):
                failures["infinite_loop"].append(traj)
            
            if self.detect_goal_drift(traj):
                failures["goal_drift"].append(traj)
            
            if self.detect_safety_violation(traj):
                failures["safety_violation"].append(traj)
        
        return failures

Safety Guardrails

Action Filtering

DfAction Guardrails

Action guardrails filter or block agent actions before they are executed. They enforce safety constraints like preventing destructive operations, limiting API access, and requiring confirmation for high-impact actions.

class SafetyGuardrails:
    def __init__(self, config):
        self.blocked_actions = config.blocked_actions
        self.require_confirmation = config.require_confirmation
        self.rate_limits = config.rate_limits
    
    def check_action(self, action, context):
        """Check if an action is safe to execute."""
        # Check blocked actions
        if action.tool in self.blocked_actions:
            return False, f"Action '{action.tool}' is blocked"
        
        # Check confirmation requirements
        if action.tool in self.require_confirmation:
            if not self.get_confirmation(action):
                return False, "User denied confirmation"
        
        # Check rate limits
        if self.exceeds_rate_limit(action):
            return False, "Rate limit exceeded"
        
        # Check parameter safety
        if not self.validate_parameters(action):
            return False, "Invalid parameters"
        
        return True, "Action approved"

Sandboxing

DfAgent Sandboxing

Agent sandboxing executes agent actions in an isolated environment with limited permissions. This prevents agents from making irreversible changes to production systems.

class AgentSandbox:
    def __init__(self, permissions):
        self.permissions = permissions
        self.action_log = []
    
    def execute(self, action):
        """Execute an action with sandbox restrictions."""
        # Check permissions
        if not self.has_permission(action):
            return {"error": "Permission denied"}
        
        # Execute in sandbox
        try:
            if action.tool == "file_write":
                result = self.sandboxed_file_write(action.input)
            elif action.tool == "execute_code":
                result = self.sandboxed_code_execution(action.input)
            elif action.tool == "api_call":
                result = self.sandboxed_api_call(action.input)
            else:
                result = execute_tool(action)
            
            self.action_log.append({"action": action, "result": result})
            return result
        except Exception as e:
            return {"error": str(e)}
    
    def sandboxed_code_execution(self, code):
        """Execute code with restrictions."""
        # Use a restricted Python environment
        # No network access, limited file system, timeout
        with RestrictedPython(code, timeout=30) as sandbox:
            return sandbox.run()

Red Teaming for Agents

Adversarial Testing

class AgentRedTeam:
    def __init__(self, agent, attack_strategies):
        self.agent = agent
        self.attacks = attack_strategies
    
    def run_red_team(self, num_tests=100):
        """Run adversarial tests on the agent."""
        results = []
        
        for i in range(num_tests):
            # Select attack strategy
            attack = random.choice(self.attacks)
            
            # Generate adversarial input
            adversarial_input = attack.generate_input()
            
            # Run agent
            trajectory = self.agent.run(adversarial_input)
            
            # Check for vulnerabilities
            vulnerabilities = self.check_vulnerabilities(trajectory)
            
            results.append({
                "attack": attack.name,
                "input": adversarial_input,
                "vulnerabilities": vulnerabilities,
                "severity": self.assess_severity(vulnerabilities)
            })
        
        return results
    
    def check_vulnerabilities(self, trajectory):
        """Check for security vulnerabilities in agent behavior."""
        vulns = []
        
        for action in trajectory.actions:
            # Check for prompt injection
            if self.detect_prompt_injection(action):
                vulns.append("prompt_injection")
            
            # Check for data exfiltration
            if self.detect_data_exfiltration(action):
                vulns.append("data_exfiltration")
            
            # Check for privilege escalation
            if self.detect_privilege_escalation(action):
                vulns.append("privilege_escalation")
        
        return vulns

Evaluation Framework

class AgentEvaluationFramework:
    def __init__(self, agent, config):
        self.agent = agent
        self.config = config
        self.evaluator = AgentEvaluator(config.test_suite)
        self.guardrails = SafetyGuardrails(config.safety)
        self.red_team = AgentRedTeam(agent, config.attacks)
    
    def full_evaluation(self):
        """Run comprehensive evaluation."""
        # Task performance
        performance = self.evaluator.evaluate(self.agent)
        
        # Safety testing
        safety = self.red_team.run_red_team()
        
        # Efficiency metrics
        efficiency = self.measure_efficiency()
        
        return {
            "performance": performance,
            "safety": safety,
            "efficiency": efficiency,
            "overall_score": self.compute_overall_score(
                performance, safety, efficiency
            )
        }

Practice Exercises

  1. Task Evaluation: Design a test suite of 20 tasks for a web-browsing agent. Include simple lookups, multi-step research, and tasks requiring tool use.

  2. Failure Analysis: Analyze 50 agent trajectories to identify the most common failure modes. What percentage of failures are recoverable?

  3. Safety Guardrails: Implement guardrails for an agent that can send emails and access files. What actions should require confirmation?

  4. Red Teaming: Design 5 adversarial attacks against a code-writing agent. Test each attack and document vulnerabilities.

Key Takeaways

Summary: Agent Evaluation and Safety

  • Task success rate measures whether agents complete their goals
  • Trajectory evaluation examines the path taken, not just the outcome
  • Common failures: hallucinated actions, parameter errors, infinite loops, goal drift
  • Action guardrails filter or block unsafe actions before execution
  • Sandboxing isolates agent execution from production systems
  • Red teaming tests agents against adversarial attacks
  • Rate limiting prevents runaway agent behavior
  • Confirmation requirements add human oversight for high-impact actions

What to Learn Next

-> LLM Safety and Red Teaming Safety testing for language models.

-> LLM Agent Frameworks Building autonomous agents with LLMs.

-> Constitutional AI Using AI feedback for alignment.

-> Tool Use and Function Calling Teaching LLMs to use external tools.

-> Building Production LLM Applications End-to-end production systems.

-> ML Ethics Ethical considerations in ML systems.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement