Wall-E Orchestration Patterns (Optum)

Patterns and guardrails for composing safe multi-agent workflows in Wall-E (Wide Array Large Language Engine), Optum's enterprise AI orchestration platform.

experimental

IDE:

claude

codex

vscode

Version:

1.0.0

Owner:epic-platform-sre

wall-e

orchestration

multi-agent

safety

optum

Wall-E Orchestration Patterns

Overview

Wall-E (Wide Array Large Language Engine) is Optum's multi-agent orchestration platform. This guide covers patterns for composing safe, effective agent workflows.

Architecture Context

Wall-E consists of three layers:

┌─────────────────────────────────────────────────────────────┐
│                  CONSUMPTION METHODS                        │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐    │
│   │ Portal  │   │ Widget  │   │   API   │   │ Copilot │    │
│   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘    │
└────────┼─────────────┼─────────────┼─────────────┼──────────┘
         │             │             │             │
┌────────┴─────────────┴─────────────┴─────────────┴──────────┐
│                    BACKEND LAYER                            │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│   │ FastAPI  │  │  Redis   │  │  Celery  │  │ Postgres │   │
│   └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────┴───────────────────────────────────┐
│                    MCP SERVERS                              │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│   │ServiceNow│  │   Ignis  │  │    SSH   │  │ Victoria │   │
│   └──────────┘  └──────────┘  └──────────┘  │ Metrics  │   │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  └──────────┘   │
│   │ InfraRED │  │ NetScout │  │  Custom  │                 │
│   └──────────┘  └──────────┘  └──────────┘                 │
└─────────────────────────────────────────────────────────────┘

Agent Contract Pattern

MUST define an agent contract for every agent:

# agent-contract.yaml
agent:
  name: incident-triage-agent
  version: '1.0.0'
  owner: platform-sre-team

  # Domain boundaries
  domain:
    primary: 'incident-management'
    secondary: ['monitoring', 'alerting']

  # Risk classification
  risk:
    tier: 2 # 1=low, 2=medium, 3=high, 4=critical
    data_classification: internal
    phi_access: false
    pii_access: false

  # Available capabilities
  capabilities:
    tools:
      - name: get-incident-details
        mcp_server: servicenow
        risk_level: read-only
      - name: get-metrics
        mcp_server: victoria-metrics
        risk_level: read-only
      - name: add-incident-comment
        mcp_server: servicenow
        risk_level: low-risk-write
      - name: update-incident-priority
        mcp_server: servicenow
        risk_level: high-risk-write
        requires_approval: true

    resources:
      - uri_pattern: 'servicenow://incident/*'
        mcp_server: servicenow
      - uri_pattern: 'victoria://metrics/*'
        mcp_server: victoria-metrics

  # Approval requirements
  approvals:
    human_in_loop:
      - action: update-incident-priority
        condition: always
      - action: close-incident
        condition: always
      - action: execute-remediation
        condition: when_production

  # Escalation path
  escalation:
    primary: '#platform-sre-oncall'
    secondary: '[email protected]'
    pager: 'platform-sre-pd'

Specialist Agent Pattern

PREFER specialist agents with narrow scopes over generalists:

# ❌ BAD: Generalist agent with broad scope
agent:
  name: everything-agent
  capabilities:
    - incident-management
    - change-management
    - infrastructure
    - security
    - compliance

# ✅ GOOD: Specialist agents with focused domains
agents:
  - name: incident-triage-agent
    domain: incident-management
    expertise:
      - symptom-analysis
      - impact-assessment
      - runbook-matching

  - name: metrics-analyst-agent
    domain: observability
    expertise:
      - metric-correlation
      - anomaly-detection
      - trend-analysis

  - name: remediation-advisor-agent
    domain: incident-remediation
    expertise:
      - runbook-execution
      - safe-command-generation
      - rollback-planning

Implementation Technology

Wall-E uses pydantic-graph for workflow orchestration and pydantic-ai for agent implementation with Azure OpenAI via Stargate authentication.

Core Dependencies

Version Compatibility:

pydantic-graph: ^0.3.0 (for BaseNode, Graph, Edge)
pydantic-ai: ^0.0.14 (for Agent, MCPServerStreamableHTTP)
pydantic: ^2.9.0 (for BaseModel, Field)
openai: ^1.54.0 (for Azure OpenAI client)

These packages are under active development. Pin to specific versions in production and test thoroughly before upgrading.

# Wall-E Core Stack
# Package versions (refer to Wall-E deployment requirements.txt):
#   pydantic-graph: ^0.1.0 (graph orchestration)
#   pydantic-ai: ^0.0.14 (agent framework)
#   pydantic: ^2.0.0 (data validation)
from pydantic_graph import BaseNode, GraphRunContext, End, Graph
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.azure import AzureProvider
from pydantic import BaseModel, Field
from dataclasses import dataclass, field

State Management Pattern

MUST use dataclass-based state with separate namespaces:

from dataclasses import dataclass, field
from pydantic_ai.messages import ModelMessage

@dataclass
class WorkflowState:
    """State shared across all workflow nodes."""

    # User input namespace
    user: dict = field(default_factory=dict)
    # Agent output namespace
    agent: dict = field(default_factory=dict)
    # Temporary buffer namespace
    buffer: dict = field(default_factory=dict)
    # Message history for context
    message_history: list[ModelMessage] = field(default_factory=list)

    def set_user_input(self, key: str, value: any) -> None:
        self.user[key] = value

    def get_agent_output(self, key: str) -> any:
        return self.agent.get(key)

Node Implementation Pattern

MUST implement nodes using BaseNode with typed return annotations:

from pydantic_graph import BaseNode, GraphRunContext, End, Edge
from typing import Annotated

class BaseWorkflowNode(BaseNode[WorkflowState]):
    """Extended BaseNode with validation and docstrings."""

    docstring_notes = True  # Include docstrings in graph visualization
    validation_schema = None  # Optional Pydantic schema for validation

    async def run(self, ctx: GraphRunContext[WorkflowState]):
        raise NotImplementedError("Subclasses MUST implement run()")

    def validate(self, data) -> list:
        """Validate data against schema if defined."""
        if self.validation_schema:
            try:
                self.validation_schema.validate(data)
                return []
            except ValidationError as e:
                return [FieldValidationError(msg=str(err["msg"]),
                                            input=str(err["input"]))
                        for err in e.errors()]
        return []

@dataclass
class AnalyzeRequest(BaseWorkflowNode):
    """Analyze the incoming request and classify intent."""

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated[
        "ExecutePlan" | "RequestClarification",
        Edge(label="classified")
    ]:
        # Analyze request using agent
        result = await analyze_agent.run(
            format_as_xml({"request": ctx.state.user.get("request")})
        )

        if result.data.needs_clarification:
            return RequestClarification()

        ctx.state.agent["analysis"] = result.data
        return ExecutePlan()

Workflow Composition Patterns

Diagnose → Propose → Execute → Verify Pattern

MUST structure workflows in four phases:

workflow:
  name: incident-remediation
  phases:
    - name: diagnose
      agents: [incident-triage-agent, metrics-analyst-agent]
      actions:
        - gather-incident-context
        - analyze-metrics
        - identify-affected-systems
        - correlate-recent-changes
      outputs:
        - diagnosis-summary
        - affected-components
        - probable-causes

    - name: propose
      agents: [remediation-advisor-agent]
      inputs: [diagnosis-summary, affected-components]
      actions:
        - match-runbooks
        - generate-remediation-options
        - assess-risk-per-option
        - recommend-approach
      outputs:
        - remediation-plan
        - risk-assessment
        - rollback-plan
      gates:
        - type: human-approval
          approver: on-call-engineer
          timeout: 15m

    - name: execute
      agents: [remediation-executor-agent]
      inputs: [remediation-plan, approval-token]
      preconditions:
        - approval-received
        - rollback-plan-validated
      actions:
        - execute-remediation-steps
        - monitor-execution
        - capture-outputs
      outputs:
        - execution-results
        - step-by-step-log

    - name: verify
      agents: [verification-agent, metrics-analyst-agent]
      inputs: [execution-results, original-symptoms]
      actions:
        - verify-symptoms-resolved
        - confirm-metrics-normal
        - validate-no-side-effects
      outputs:
        - verification-report
        - incident-resolution-status

Orchestrator Pattern (DocX)

Wall-E uses DocX as the primary orchestrator agent. Here is the actual implementation pattern:

from pydantic_graph import Graph
from dataclasses import dataclass

@dataclass
class StartNode(BaseWorkflowNode):
    """Entry point - initialize workflow state."""

    async def run(self, ctx: GraphRunContext[WorkflowState]) -> "GreetingNode":
        ctx.state.user["session_id"] = generate_session_id()
        return GreetingNode()

@dataclass
class GreetingNode(BaseWorkflowNode):
    """Greet user and collect initial input."""

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated["AskQuestion", Edge(label="greeted")]:
        ctx.state.agent["greeting"] = "How can I help you today?"
        return AskQuestion()

@dataclass
class AskQuestion(BaseWorkflowNode):
    """Collect and validate user question."""
    validation_schema = AskQuestionSchema

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated[
        "Evaluate" | "Reprimand",
        Edge(label="question_received")
    ]:
        question = ctx.state.user.get("question")
        errors = self.validate({"question": question})

        if errors:
            ctx.state.agent["error"] = errors
            return Reprimand()

        return Evaluate()

@dataclass
class Evaluate(BaseWorkflowNode):
    """Evaluate if question is valid and actionable."""

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated[
        "GenerateSQL" | "Reprimand",
        Edge(label="Good Question") | Edge(label="Bad Question")
    ]:
        result = await evaluate_agent.run(
            format_as_xml({"question": ctx.state.user.get("question")})
        )

        if result.data.correct:
            return GenerateSQL()
        else:
            ctx.state.agent["reprimand_reason"] = result.data.reason
            return Reprimand()

# Build the graph
docx_graph = Graph(
    nodes=[StartNode, GreetingNode, AskQuestion, Evaluate,
           GenerateSQL, ChooseQuery, FetchData, ContextNode,
           EvalNode, RunAgain, Reprimand],
    state_type=WorkflowState
)

Graph Execution Pattern

MUST execute graphs with proper error handling:

async def run_workflow(user_input: str, history: list[ModelMessage] = None):
    """Execute a Wall-E workflow with state management."""

    # Initialize state
    state = WorkflowState()
    state.user["request"] = user_input
    if history:
        state.message_history = history

    # Run graph
    async with docx_graph.iter(
        StartNode(),
        state=state,
        deps=WorkflowDeps()
    ) as graph_run:
        async for node, next_node in graph_run:
            # Log node transitions for observability
            logger.info(f"Transition: {node.name()} → {next_node.name()}")

            # Check for terminal states
            if isinstance(next_node, End):
                break

    return state.agent.get("result")

MCP Integration Pattern

MUST proxy MCP requests through authenticated endpoints:

from httpx import AsyncClient
from fastapi import APIRouter, Request

router = APIRouter()

@router.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE"])
async def proxy_mcp_request(request: Request, path: str):
    """Proxy requests to MCP servers with authentication."""

    # Get auth token from request
    auth_header = request.headers.get("Authorization")

    # Build target URL
    target_url = f"{MCP_SERVER_BASE_URL}/{path}"

    # Forward request with auth
    async with AsyncClient() as client:
        response = await client.request(
            method=request.method,
            url=target_url,
            headers={"Authorization": auth_header},
            content=await request.body(),
            params=request.query_params
        )

    return Response(
        content=response.content,
        status_code=response.status_code,
        headers=dict(response.headers)
    )

MCP Server Implementation Pattern

FastMCP Server Pattern

MUST implement MCP servers using FastMCP:

from fastmcp import FastMCP

instructions = """
This MCP server provides tools for querying ServiceNow incidents.
Available tools:
- fetch_incidents: Retrieve incidents by site code
"""

mcp = FastMCP(
    name="ServiceNow MCP Server",
    version="1.0.0",
    instructions=instructions,
)

@mcp.tool()
def fetch_incidents(site_code: str | None) -> list[dict]:
    """
    Fetch active incidents from ServiceNow.

    Args:
        site_code: Optional site code to filter incidents

    Returns:
        List of incident records with number, priority, description
    """
    query = build_servicenow_query(site_code)
    return servicenow_client.query("incident", query)

def main():
    mcp.run(transport="http", host="0.0.0.0", port=3001)

MCP Client Pattern

MUST connect to MCP servers using pydantic-ai integration:

from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStreamableHTTP

async def get_agent_with_mcp(mcp_server_url: str, system_prompt: str) -> Agent:
    """Create an agent connected to an MCP server."""

    # Get Azure OpenAI client via Stargate
    openai_client = await get_openai_client()
    model = OpenAIModel("gpt-4o", provider=AzureProvider(openai_client=openai_client))

    # Configure MCP server connection
    mcp_server = MCPServerStreamableHTTP(
        url=mcp_server_url,
        sse_read_timeout=300  # 5 minute timeout for long operations
    )

    return Agent(
        model=model,
        system_prompt=system_prompt,
        mcp_servers=[mcp_server]
    )

# Usage
async def main():
    agent = await get_agent_with_mcp(
        mcp_server_url="http://localhost:3001/mcp",
        system_prompt="You are a helpful incident analyst."
    )

    async with agent.run_mcp_servers():
        result = await agent.run("What incidents are open for site ABC123?")
        print(result.output)

Agent Configuration Pattern

Database-Driven Agent Definition

Wall-E stores agent configurations in PostgreSQL:

from django.db import models
from pydantic import BaseModel, Field

class Agent(models.Model):
    """Agent configuration stored in database."""

    name = models.CharField(max_length=20, unique=True)
    foundation_model = models.ForeignKey("FoundationModel", on_delete=models.DO_NOTHING)
    system_message = models.ForeignKey("SystemMessage", on_delete=models.DO_NOTHING)
    retries = models.IntegerField(default=2)
    output_schema = models.TextField()  # Pydantic schema as Python code
    pipe = models.TextField(default="data = output.advice")  # Output transformation
    tools = models.ManyToManyField("Tool", blank=True)

    async def initialize(self):
        """Initialize a WalleAgent from database config."""
        from chat.walle import WalleAgent

        tools = [await tool.execute() async for tool in self.tools.all()]

        agent = WalleAgent(
            result_type=await self.validate_output_schema(),
            system_prompt=await self.get_system_message(),
            retries=self.retries,
            tools=tools if tools else None,
        )

        await agent.initialize_model(agent.model)
        return agent

    async def validate_output_schema(self):
        """Dynamically validate and return Pydantic model from code.

        SECURITY WARNING: This method uses exec() to execute user-provided code,
        which presents a code injection risk. In production:
        1. ONLY execute schemas from trusted sources (database with access controls)
        2. NEVER accept schemas from user input or external APIs
        3. Consider using a declarative format (JSON/YAML) instead of Python code
        4. Implement schema validation/sanitization before exec()
        5. Run in a sandboxed environment with restricted permissions
        """
        namespace = {"BaseModel": BaseModel, "Field": Field}
        local_ns = {}

        # SECURITY: Only execute trusted schemas from controlled sources
        exec(self.output_schema, namespace, local_ns)

        # Find the Pydantic model in executed code
        models = {
            name: obj for name, obj in local_ns.items()
            if isinstance(obj, type) and issubclass(obj, BaseModel)
        }

        if len(models) != 1:
            raise ValueError("Output schema MUST define exactly one Pydantic model")

        return next(iter(models.values()))

WalleAgent Implementation

MUST implement agents with Stargate authentication and retry logic:

import os
import openai
import httpx
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.azure import AzureProvider

# Validate required environment variables at module load time
REQUIRED_ENV_VARS = [
    "TOKEN_URL",
    "SCOPE",
    "CLIENT_ID",
    "CLIENT_SECRET",
    "AZURE_ENDPOINT",
    "API_VERSION",
    "AZURE_DEPLOYMENT",
    "PROJECT_ID",
    "X_UPSTREAM_ENV",
]

missing_vars = [var for var in REQUIRED_ENV_VARS if not os.getenv(var)]
if missing_vars:
    raise EnvironmentError(
        f"Missing required environment variables: {', '.join(missing_vars)}. "
        f"Set these before starting the application."
    )

# Validate required environment variables at module load time
_REQUIRED_ENV_VARS = [
    "AZURE_ENDPOINT",
    "API_VERSION",
    "AZURE_DEPLOYMENT",
    "PROJECT_ID",
    "X_UPSTREAM_ENV",
    "TOKEN_URL",
    "SCOPE",
    "CLIENT_ID",
    "CLIENT_SECRET",
]

_missing_vars = [var for var in _REQUIRED_ENV_VARS if not os.getenv(var)]
if _missing_vars:
    raise EnvironmentError(
        f"Missing required environment variables: {', '.join(_missing_vars)}. "
        f"Please configure these before importing this module."
    )


class WalleAgent:
    """Wall-E agent wrapper with Azure OpenAI integration."""

    def __init__(
        self,
        result_type: type = str,
        system_prompt: str = "",
        retries: int = 2,
        tools: list = None,
        deps_type: type = None,
    ):
        self.result_type = result_type
        self.system_prompt = system_prompt
        self.retries = retries
        self.tools = tools or []
        self.deps_type = deps_type
        self.model = "gpt-4o"
        self._agent = None
        self._openai_client = None

    async def initialize_model(self, model_name: str):
        """Initialize Azure OpenAI client via Stargate."""
        access_token = await self._get_stargate_token()

        self._openai_client = openai.AsyncAzureOpenAI(
            azure_endpoint=os.getenv("AZURE_ENDPOINT"),
            api_version=os.getenv("API_VERSION"),
            azure_deployment=os.getenv("AZURE_DEPLOYMENT"),
            azure_ad_token=access_token,
            default_headers={
                "projectId": os.getenv("PROJECT_ID"),
                "x-upstream-env": os.getenv("X_UPSTREAM_ENV"),
            },
        )

        model = OpenAIModel(
            model_name,
            provider=AzureProvider(openai_client=self._openai_client)
        )

        self._agent = Agent(
            model=model,
            result_type=self.result_type,
            system_prompt=self.system_prompt,
            retries=self.retries,
            tools=self.tools,
            deps_type=self.deps_type,
        )

    async def run(self, prompt: str, history: list = None, deps=None):
        """Run agent with automatic token refresh on auth failures."""
        try:
            return await self._agent.run(
                prompt,
                message_history=history,
                deps=deps
            )
        except openai.AuthenticationError:
            # Re-authenticate and retry
            await self.initialize_model(self.model)
            return await self._agent.run(
                prompt,
                message_history=history,
                deps=deps
            )

    async def _get_stargate_token(self) -> str:
        """Get OAuth2 token from Stargate.

        Environment variables are validated at module load time,
        so this method can safely access them without additional checks.
        """
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                os.getenv("TOKEN_URL"),
                headers={"Content-Type": "application/x-www-form-urlencoded"},
                data={
                    "grant_type": "client_credentials",
                    "scope": os.getenv("SCOPE"),
                    "client_id": os.getenv("CLIENT_ID"),
                    "client_secret": os.getenv("CLIENT_SECRET"),
                },
                timeout=60
            )
            resp.raise_for_status()
            return resp.json()["access_token"]

MCP Integration Patterns

Resource vs Tool Separation

MUST separate context (resources) from actions (tools):

# Resources = Read-only context
resources:
  - name: incident-context
    uri: servicenow://incident/{id}
    use_for: 'providing incident details to agents'
    updates: on-demand

  - name: metrics-context
    uri: victoria://metrics/{query}
    use_for: 'providing metric data for analysis'
    updates: real-time

# Tools = Actions with side effects
tools:
  - name: add-comment
    type: action
    side_effects: true
    risk: low-risk-write

  - name: execute-command
    type: action
    side_effects: true
    risk: high-risk-write
    approval_required: true

Tool Risk Mapping

MUST map tools to Wall-E policy tiers:

policy_tiers:
  tier_1_allow:
    description: 'Auto-approved, logged'
    tools:
      - get-incident-details
      - get-metrics
      - list-pods
      - describe-deployment

  tier_2_allow_with_audit:
    description: 'Auto-approved, enhanced audit'
    tools:
      - add-incident-comment
      - create-change-request
      - tag-resource

  tier_3_require_approval:
    description: 'Human approval required'
    tools:
      - update-incident-priority
      - close-incident
      - scale-deployment

  tier_4_require_dual_approval:
    description: 'Two approvers required'
    tools:
      - execute-ssh-command
      - modify-infrastructure
      - delete-resource

Human-in-the-Loop Patterns

Approval Gate Pattern

MUST implement approval gates for high-risk operations:

approval_gate:
  name: remediation-approval
  trigger: before-execute-phase

  request:
    format: structured
    fields:
      - action-summary
      - risk-assessment
      - affected-systems
      - rollback-plan
      - estimated-duration

  approval_options:
    - approve
    - approve-with-modifications
    - reject
    - escalate

  timeout:
    duration: 15m
    action: escalate-to-secondary

  audit:
    log_request: true
    log_decision: true
    log_approver: true
    log_reasoning: true

Uncertainty Surfacing Pattern

MUST surface uncertainty explicitly:

uncertainty_handling:
  confidence_thresholds:
    high: 0.85
    medium: 0.65
    low: 0.45

  actions_by_confidence:
    high:
      - proceed-with-recommendation
      - log-confidence-score

    medium:
      - present-alternatives
      - request-human-validation
      - provide-supporting-evidence

    low:
      - halt-workflow
      - escalate-to-human
      - request-additional-context

  uncertainty_signals:
    - conflicting-evidence
    - missing-context
    - novel-scenario
    - ambiguous-request

Safety Guardrails

Enterprise Workflow Bypass Prevention

NEVER allow agents to bypass enterprise change workflows:

# Blocked patterns
blocked_operations:
  terraform:
    - terraform apply # Must go through TFE + CI/CD
    - terraform destroy # Must go through TFE + CI/CD

  kubernetes:
    - kubectl delete namespace # Must go through GitOps
    - kubectl apply -f # Must go through GitOps (prefer manifests)

  itsm:
    - close-incident-without-resolution # Must have resolution
    - skip-change-approval # Must follow CAB process

Production Safety Rules

MUST enforce production safety:

production_rules:
  read_only_by_default: true

  allowed_read_operations:
    - get-logs
    - get-metrics
    - describe-resources
    - list-pods
    - get-incident-details

  gated_write_operations:
    - restart-pod # Requires approval
    - scale-deployment # Requires approval
    - update-configmap # Requires approval + GitOps

  blocked_operations:
    - delete-pvc # Data loss risk
    - delete-namespace # Blast radius too high
    - direct-db-writes # Must use application APIs

Blast Radius Assessment

MUST assess blast radius before actions:

blast_radius_assessment:
  criteria:
    - affected_users_count
    - affected_services_count
    - data_at_risk
    - revenue_impact
    - clinical_impact

  thresholds:
    low:
      max_affected_users: 100
      max_affected_services: 2
      clinical_impact: none

    medium:
      max_affected_users: 1000
      max_affected_services: 5
      clinical_impact: none

    high:
      max_affected_users: 10000
      max_affected_services: 10
      clinical_impact: possible

    critical:
      clinical_impact: any

  actions_by_blast_radius:
    low: proceed-with-logging
    medium: require-single-approval
    high: require-dual-approval
    critical: require-leadership-approval

RAG Integration Patterns

Citation Verification

MUST verify RAG citations before execution:

rag_verification:
  required_for:
    - runbook-execution
    - configuration-changes
    - remediation-steps

  verification_steps:
    - check-source-exists
    - check-source-current # Last updated < 90 days
    - check-source-approved # In approved knowledge base
    - cross-reference-actions # Match against known procedures

  on_verification_failure:
    - flag-unverified-citation
    - request-human-validation
    - suggest-alternative-sources

Knowledge Base Boundaries

MUST constrain RAG to approved sources:

knowledge_sources:
  approved:
    - source: runbook-repository
      uri: 'https://runbooks.internal/*'
      trust_level: high

    - source: confluence-ops
      uri: 'https://confluence.internal/ops/*'
      trust_level: medium

    - source: servicenow-kb
      uri: 'servicenow://kb/*'
      trust_level: medium

  excluded:
    - external-forums
    - unverified-wikis
    - personal-notes

Observability Requirements

Correlation ID Propagation

MUST propagate correlation IDs across agents:

tracing:
  correlation_id:
    header: X-Correlation-ID
    format: uuid-v4
    propagate_to:
      - all-agent-calls
      - all-mcp-tool-calls
      - all-external-apis
      - all-log-entries

  span_creation:
    - workflow-start
    - phase-transition
    - agent-handoff
    - tool-invocation
    - approval-request

Audit Logging

MUST log all significant events:

audit_events:
  - event: workflow-started
    fields: [workflow_id, user_id, trigger_source]

  - event: agent-invoked
    fields: [agent_name, input_summary, correlation_id]

  - event: tool-called
    fields: [tool_name, mcp_server, risk_level, input_params]

  - event: approval-requested
    fields: [action, approver, risk_assessment]

  - event: approval-decision
    fields: [action, approver, decision, reasoning]

  - event: workflow-completed
    fields: [workflow_id, outcome, duration, actions_taken]

Configuration Reference

Wall-E configuration hierarchy:

# Project → Access Group → Agent → MCP → System Message → Deployment

project:
  name: platform-sre
  billing_code: PLAT-001
  owner: sre-leadership

access_groups:
  - name: sre-engineers
    permissions:
      - view-all-agents
      - use-read-only-tools
      - request-write-approvals

  - name: sre-leads
    permissions:
      - all-engineer-permissions
      - approve-tier-3-operations
      - configure-agents

agents:
  - name: incident-triage-agent
    model: gpt-4o
    temperature: 0.3
    system_message_ref: incident-triage-prompt
    mcp_servers:
      - servicenow
      - victoria-metrics

mcp_servers:
  - name: servicenow
    endpoint: https://mcp-servicenow.internal
    auth: managed-identity

  - name: victoria-metrics
    endpoint: https://mcp-victoria.internal
    auth: managed-identity

deployments:
  - environment: production
    agents: [incident-triage-agent]
    approval_tier: tier-3