Skip to content

Wall-E Orchestration Patterns (Optum)

Patterns and guardrails for composing safe multi-agent workflows in Wall-E (Wide Array Large Language Engine), Optum's enterprise AI orchestration platform.

experimental
IDE:
claude
codex
vscode
Version:
1.0.0
Owner:epic-platform-sre
wall-e
orchestration
multi-agent
safety
optum

Wall-E Orchestration Patterns

Overview

Wall-E (Wide Array Large Language Engine) is Optum's multi-agent orchestration platform. This guide covers patterns for composing safe, effective agent workflows.

Architecture Context

Wall-E consists of three layers:

┌─────────────────────────────────────────────────────────────┐
│                  CONSUMPTION METHODS                        │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐    │
│   │ Portal  │   │ Widget  │   │   API   │   │ Copilot │    │
│   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘    │
└────────┼─────────────┼─────────────┼─────────────┼──────────┘
         │             │             │             │
┌────────┴─────────────┴─────────────┴─────────────┴──────────┐
│                    BACKEND LAYER                            │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│   │ FastAPI  │  │  Redis   │  │  Celery  │  │ Postgres │   │
│   └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────────────┴───────────────────────────────────┐
│                    MCP SERVERS                              │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│   │ServiceNow│  │   Ignis  │  │    SSH   │  │ Victoria │   │
│   └──────────┘  └──────────┘  └──────────┘  │ Metrics  │   │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  └──────────┘   │
│   │ InfraRED │  │ NetScout │  │  Custom  │                 │
│   └──────────┘  └──────────┘  └──────────┘                 │
└─────────────────────────────────────────────────────────────┘

Agent Contract Pattern

MUST define an agent contract for every agent:

# agent-contract.yaml
agent:
  name: incident-triage-agent
  version: '1.0.0'
  owner: platform-sre-team

  # Domain boundaries
  domain:
    primary: 'incident-management'
    secondary: ['monitoring', 'alerting']

  # Risk classification
  risk:
    tier: 2 # 1=low, 2=medium, 3=high, 4=critical
    data_classification: internal
    phi_access: false
    pii_access: false

  # Available capabilities
  capabilities:
    tools:
      - name: get-incident-details
        mcp_server: servicenow
        risk_level: read-only
      - name: get-metrics
        mcp_server: victoria-metrics
        risk_level: read-only
      - name: add-incident-comment
        mcp_server: servicenow
        risk_level: low-risk-write
      - name: update-incident-priority
        mcp_server: servicenow
        risk_level: high-risk-write
        requires_approval: true

    resources:
      - uri_pattern: 'servicenow://incident/*'
        mcp_server: servicenow
      - uri_pattern: 'victoria://metrics/*'
        mcp_server: victoria-metrics

  # Approval requirements
  approvals:
    human_in_loop:
      - action: update-incident-priority
        condition: always
      - action: close-incident
        condition: always
      - action: execute-remediation
        condition: when_production

  # Escalation path
  escalation:
    primary: '#platform-sre-oncall'
    secondary: '[email protected]'
    pager: 'platform-sre-pd'

Specialist Agent Pattern

PREFER specialist agents with narrow scopes over generalists:

# ❌ BAD: Generalist agent with broad scope
agent:
  name: everything-agent
  capabilities:
    - incident-management
    - change-management
    - infrastructure
    - security
    - compliance

# ✅ GOOD: Specialist agents with focused domains
agents:
  - name: incident-triage-agent
    domain: incident-management
    expertise:
      - symptom-analysis
      - impact-assessment
      - runbook-matching

  - name: metrics-analyst-agent
    domain: observability
    expertise:
      - metric-correlation
      - anomaly-detection
      - trend-analysis

  - name: remediation-advisor-agent
    domain: incident-remediation
    expertise:
      - runbook-execution
      - safe-command-generation
      - rollback-planning

Implementation Technology

Wall-E uses pydantic-graph for workflow orchestration and pydantic-ai for agent implementation with Azure OpenAI via Stargate authentication.

Core Dependencies

Version Compatibility:

  • pydantic-graph: ^0.3.0 (for BaseNode, Graph, Edge)
  • pydantic-ai: ^0.0.14 (for Agent, MCPServerStreamableHTTP)
  • pydantic: ^2.9.0 (for BaseModel, Field)
  • openai: ^1.54.0 (for Azure OpenAI client)

These packages are under active development. Pin to specific versions in production and test thoroughly before upgrading.

# Wall-E Core Stack
# Package versions (refer to Wall-E deployment requirements.txt):
#   pydantic-graph: ^0.1.0 (graph orchestration)
#   pydantic-ai: ^0.0.14 (agent framework)
#   pydantic: ^2.0.0 (data validation)
from pydantic_graph import BaseNode, GraphRunContext, End, Graph
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.azure import AzureProvider
from pydantic import BaseModel, Field
from dataclasses import dataclass, field

State Management Pattern

MUST use dataclass-based state with separate namespaces:

from dataclasses import dataclass, field
from pydantic_ai.messages import ModelMessage

@dataclass
class WorkflowState:
    """State shared across all workflow nodes."""

    # User input namespace
    user: dict = field(default_factory=dict)
    # Agent output namespace
    agent: dict = field(default_factory=dict)
    # Temporary buffer namespace
    buffer: dict = field(default_factory=dict)
    # Message history for context
    message_history: list[ModelMessage] = field(default_factory=list)

    def set_user_input(self, key: str, value: any) -> None:
        self.user[key] = value

    def get_agent_output(self, key: str) -> any:
        return self.agent.get(key)

Node Implementation Pattern

MUST implement nodes using BaseNode with typed return annotations:

from pydantic_graph import BaseNode, GraphRunContext, End, Edge
from typing import Annotated

class BaseWorkflowNode(BaseNode[WorkflowState]):
    """Extended BaseNode with validation and docstrings."""

    docstring_notes = True  # Include docstrings in graph visualization
    validation_schema = None  # Optional Pydantic schema for validation

    async def run(self, ctx: GraphRunContext[WorkflowState]):
        raise NotImplementedError("Subclasses MUST implement run()")

    def validate(self, data) -> list:
        """Validate data against schema if defined."""
        if self.validation_schema:
            try:
                self.validation_schema.validate(data)
                return []
            except ValidationError as e:
                return [FieldValidationError(msg=str(err["msg"]),
                                            input=str(err["input"]))
                        for err in e.errors()]
        return []

@dataclass
class AnalyzeRequest(BaseWorkflowNode):
    """Analyze the incoming request and classify intent."""

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated[
        "ExecutePlan" | "RequestClarification",
        Edge(label="classified")
    ]:
        # Analyze request using agent
        result = await analyze_agent.run(
            format_as_xml({"request": ctx.state.user.get("request")})
        )

        if result.data.needs_clarification:
            return RequestClarification()

        ctx.state.agent["analysis"] = result.data
        return ExecutePlan()

Workflow Composition Patterns

Diagnose → Propose → Execute → Verify Pattern

MUST structure workflows in four phases:

workflow:
  name: incident-remediation
  phases:
    - name: diagnose
      agents: [incident-triage-agent, metrics-analyst-agent]
      actions:
        - gather-incident-context
        - analyze-metrics
        - identify-affected-systems
        - correlate-recent-changes
      outputs:
        - diagnosis-summary
        - affected-components
        - probable-causes

    - name: propose
      agents: [remediation-advisor-agent]
      inputs: [diagnosis-summary, affected-components]
      actions:
        - match-runbooks
        - generate-remediation-options
        - assess-risk-per-option
        - recommend-approach
      outputs:
        - remediation-plan
        - risk-assessment
        - rollback-plan
      gates:
        - type: human-approval
          approver: on-call-engineer
          timeout: 15m

    - name: execute
      agents: [remediation-executor-agent]
      inputs: [remediation-plan, approval-token]
      preconditions:
        - approval-received
        - rollback-plan-validated
      actions:
        - execute-remediation-steps
        - monitor-execution
        - capture-outputs
      outputs:
        - execution-results
        - step-by-step-log

    - name: verify
      agents: [verification-agent, metrics-analyst-agent]
      inputs: [execution-results, original-symptoms]
      actions:
        - verify-symptoms-resolved
        - confirm-metrics-normal
        - validate-no-side-effects
      outputs:
        - verification-report
        - incident-resolution-status

Orchestrator Pattern (DocX)

Wall-E uses DocX as the primary orchestrator agent. Here is the actual implementation pattern:

from pydantic_graph import Graph
from dataclasses import dataclass

@dataclass
class StartNode(BaseWorkflowNode):
    """Entry point - initialize workflow state."""

    async def run(self, ctx: GraphRunContext[WorkflowState]) -> "GreetingNode":
        ctx.state.user["session_id"] = generate_session_id()
        return GreetingNode()

@dataclass
class GreetingNode(BaseWorkflowNode):
    """Greet user and collect initial input."""

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated["AskQuestion", Edge(label="greeted")]:
        ctx.state.agent["greeting"] = "How can I help you today?"
        return AskQuestion()

@dataclass
class AskQuestion(BaseWorkflowNode):
    """Collect and validate user question."""
    validation_schema = AskQuestionSchema

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated[
        "Evaluate" | "Reprimand",
        Edge(label="question_received")
    ]:
        question = ctx.state.user.get("question")
        errors = self.validate({"question": question})

        if errors:
            ctx.state.agent["error"] = errors
            return Reprimand()

        return Evaluate()

@dataclass
class Evaluate(BaseWorkflowNode):
    """Evaluate if question is valid and actionable."""

    async def run(
        self, ctx: GraphRunContext[WorkflowState]
    ) -> Annotated[
        "GenerateSQL" | "Reprimand",
        Edge(label="Good Question") | Edge(label="Bad Question")
    ]:
        result = await evaluate_agent.run(
            format_as_xml({"question": ctx.state.user.get("question")})
        )

        if result.data.correct:
            return GenerateSQL()
        else:
            ctx.state.agent["reprimand_reason"] = result.data.reason
            return Reprimand()

# Build the graph
docx_graph = Graph(
    nodes=[StartNode, GreetingNode, AskQuestion, Evaluate,
           GenerateSQL, ChooseQuery, FetchData, ContextNode,
           EvalNode, RunAgain, Reprimand],
    state_type=WorkflowState
)

Graph Execution Pattern

MUST execute graphs with proper error handling:

async def run_workflow(user_input: str, history: list[ModelMessage] = None):
    """Execute a Wall-E workflow with state management."""

    # Initialize state
    state = WorkflowState()
    state.user["request"] = user_input
    if history:
        state.message_history = history

    # Run graph
    async with docx_graph.iter(
        StartNode(),
        state=state,
        deps=WorkflowDeps()
    ) as graph_run:
        async for node, next_node in graph_run:
            # Log node transitions for observability
            logger.info(f"Transition: {node.name()}{next_node.name()}")

            # Check for terminal states
            if isinstance(next_node, End):
                break

    return state.agent.get("result")

MCP Integration Pattern

MUST proxy MCP requests through authenticated endpoints:

from httpx import AsyncClient
from fastapi import APIRouter, Request

router = APIRouter()

@router.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE"])
async def proxy_mcp_request(request: Request, path: str):
    """Proxy requests to MCP servers with authentication."""

    # Get auth token from request
    auth_header = request.headers.get("Authorization")

    # Build target URL
    target_url = f"{MCP_SERVER_BASE_URL}/{path}"

    # Forward request with auth
    async with AsyncClient() as client:
        response = await client.request(
            method=request.method,
            url=target_url,
            headers={"Authorization": auth_header},
            content=await request.body(),
            params=request.query_params
        )

    return Response(
        content=response.content,
        status_code=response.status_code,
        headers=dict(response.headers)
    )

MCP Server Implementation Pattern

FastMCP Server Pattern

MUST implement MCP servers using FastMCP:

from fastmcp import FastMCP

instructions = """
This MCP server provides tools for querying ServiceNow incidents.
Available tools:
- fetch_incidents: Retrieve incidents by site code
"""

mcp = FastMCP(
    name="ServiceNow MCP Server",
    version="1.0.0",
    instructions=instructions,
)

@mcp.tool()
def fetch_incidents(site_code: str | None) -> list[dict]:
    """
    Fetch active incidents from ServiceNow.

    Args:
        site_code: Optional site code to filter incidents

    Returns:
        List of incident records with number, priority, description
    """
    query = build_servicenow_query(site_code)
    return servicenow_client.query("incident", query)

def main():
    mcp.run(transport="http", host="0.0.0.0", port=3001)

MCP Client Pattern

MUST connect to MCP servers using pydantic-ai integration:

from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStreamableHTTP

async def get_agent_with_mcp(mcp_server_url: str, system_prompt: str) -> Agent:
    """Create an agent connected to an MCP server."""

    # Get Azure OpenAI client via Stargate
    openai_client = await get_openai_client()
    model = OpenAIModel("gpt-4o", provider=AzureProvider(openai_client=openai_client))

    # Configure MCP server connection
    mcp_server = MCPServerStreamableHTTP(
        url=mcp_server_url,
        sse_read_timeout=300  # 5 minute timeout for long operations
    )

    return Agent(
        model=model,
        system_prompt=system_prompt,
        mcp_servers=[mcp_server]
    )

# Usage
async def main():
    agent = await get_agent_with_mcp(
        mcp_server_url="http://localhost:3001/mcp",
        system_prompt="You are a helpful incident analyst."
    )

    async with agent.run_mcp_servers():
        result = await agent.run("What incidents are open for site ABC123?")
        print(result.output)

Agent Configuration Pattern

Database-Driven Agent Definition

Wall-E stores agent configurations in PostgreSQL:

from django.db import models
from pydantic import BaseModel, Field

class Agent(models.Model):
    """Agent configuration stored in database."""

    name = models.CharField(max_length=20, unique=True)
    foundation_model = models.ForeignKey("FoundationModel", on_delete=models.DO_NOTHING)
    system_message = models.ForeignKey("SystemMessage", on_delete=models.DO_NOTHING)
    retries = models.IntegerField(default=2)
    output_schema = models.TextField()  # Pydantic schema as Python code
    pipe = models.TextField(default="data = output.advice")  # Output transformation
    tools = models.ManyToManyField("Tool", blank=True)

    async def initialize(self):
        """Initialize a WalleAgent from database config."""
        from chat.walle import WalleAgent

        tools = [await tool.execute() async for tool in self.tools.all()]

        agent = WalleAgent(
            result_type=await self.validate_output_schema(),
            system_prompt=await self.get_system_message(),
            retries=self.retries,
            tools=tools if tools else None,
        )

        await agent.initialize_model(agent.model)
        return agent

    async def validate_output_schema(self):
        """Dynamically validate and return Pydantic model from code.

        SECURITY WARNING: This method uses exec() to execute user-provided code,
        which presents a code injection risk. In production:
        1. ONLY execute schemas from trusted sources (database with access controls)
        2. NEVER accept schemas from user input or external APIs
        3. Consider using a declarative format (JSON/YAML) instead of Python code
        4. Implement schema validation/sanitization before exec()
        5. Run in a sandboxed environment with restricted permissions
        """
        namespace = {"BaseModel": BaseModel, "Field": Field}
        local_ns = {}

        # SECURITY: Only execute trusted schemas from controlled sources
        exec(self.output_schema, namespace, local_ns)

        # Find the Pydantic model in executed code
        models = {
            name: obj for name, obj in local_ns.items()
            if isinstance(obj, type) and issubclass(obj, BaseModel)
        }

        if len(models) != 1:
            raise ValueError("Output schema MUST define exactly one Pydantic model")

        return next(iter(models.values()))

WalleAgent Implementation

MUST implement agents with Stargate authentication and retry logic:

import os
import openai
import httpx
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.azure import AzureProvider

# Validate required environment variables at module load time
REQUIRED_ENV_VARS = [
    "TOKEN_URL",
    "SCOPE",
    "CLIENT_ID",
    "CLIENT_SECRET",
    "AZURE_ENDPOINT",
    "API_VERSION",
    "AZURE_DEPLOYMENT",
    "PROJECT_ID",
    "X_UPSTREAM_ENV",
]

missing_vars = [var for var in REQUIRED_ENV_VARS if not os.getenv(var)]
if missing_vars:
    raise EnvironmentError(
        f"Missing required environment variables: {', '.join(missing_vars)}. "
        f"Set these before starting the application."
    )

# Validate required environment variables at module load time
_REQUIRED_ENV_VARS = [
    "AZURE_ENDPOINT",
    "API_VERSION",
    "AZURE_DEPLOYMENT",
    "PROJECT_ID",
    "X_UPSTREAM_ENV",
    "TOKEN_URL",
    "SCOPE",
    "CLIENT_ID",
    "CLIENT_SECRET",
]

_missing_vars = [var for var in _REQUIRED_ENV_VARS if not os.getenv(var)]
if _missing_vars:
    raise EnvironmentError(
        f"Missing required environment variables: {', '.join(_missing_vars)}. "
        f"Please configure these before importing this module."
    )


class WalleAgent:
    """Wall-E agent wrapper with Azure OpenAI integration."""

    def __init__(
        self,
        result_type: type = str,
        system_prompt: str = "",
        retries: int = 2,
        tools: list = None,
        deps_type: type = None,
    ):
        self.result_type = result_type
        self.system_prompt = system_prompt
        self.retries = retries
        self.tools = tools or []
        self.deps_type = deps_type
        self.model = "gpt-4o"
        self._agent = None
        self._openai_client = None

    async def initialize_model(self, model_name: str):
        """Initialize Azure OpenAI client via Stargate."""
        access_token = await self._get_stargate_token()

        self._openai_client = openai.AsyncAzureOpenAI(
            azure_endpoint=os.getenv("AZURE_ENDPOINT"),
            api_version=os.getenv("API_VERSION"),
            azure_deployment=os.getenv("AZURE_DEPLOYMENT"),
            azure_ad_token=access_token,
            default_headers={
                "projectId": os.getenv("PROJECT_ID"),
                "x-upstream-env": os.getenv("X_UPSTREAM_ENV"),
            },
        )

        model = OpenAIModel(
            model_name,
            provider=AzureProvider(openai_client=self._openai_client)
        )

        self._agent = Agent(
            model=model,
            result_type=self.result_type,
            system_prompt=self.system_prompt,
            retries=self.retries,
            tools=self.tools,
            deps_type=self.deps_type,
        )

    async def run(self, prompt: str, history: list = None, deps=None):
        """Run agent with automatic token refresh on auth failures."""
        try:
            return await self._agent.run(
                prompt,
                message_history=history,
                deps=deps
            )
        except openai.AuthenticationError:
            # Re-authenticate and retry
            await self.initialize_model(self.model)
            return await self._agent.run(
                prompt,
                message_history=history,
                deps=deps
            )

    async def _get_stargate_token(self) -> str:
        """Get OAuth2 token from Stargate.

        Environment variables are validated at module load time,
        so this method can safely access them without additional checks.
        """
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                os.getenv("TOKEN_URL"),
                headers={"Content-Type": "application/x-www-form-urlencoded"},
                data={
                    "grant_type": "client_credentials",
                    "scope": os.getenv("SCOPE"),
                    "client_id": os.getenv("CLIENT_ID"),
                    "client_secret": os.getenv("CLIENT_SECRET"),
                },
                timeout=60
            )
            resp.raise_for_status()
            return resp.json()["access_token"]

MCP Integration Patterns

Resource vs Tool Separation

MUST separate context (resources) from actions (tools):

# Resources = Read-only context
resources:
  - name: incident-context
    uri: servicenow://incident/{id}
    use_for: 'providing incident details to agents'
    updates: on-demand

  - name: metrics-context
    uri: victoria://metrics/{query}
    use_for: 'providing metric data for analysis'
    updates: real-time

# Tools = Actions with side effects
tools:
  - name: add-comment
    type: action
    side_effects: true
    risk: low-risk-write

  - name: execute-command
    type: action
    side_effects: true
    risk: high-risk-write
    approval_required: true

Tool Risk Mapping

MUST map tools to Wall-E policy tiers:

policy_tiers:
  tier_1_allow:
    description: 'Auto-approved, logged'
    tools:
      - get-incident-details
      - get-metrics
      - list-pods
      - describe-deployment

  tier_2_allow_with_audit:
    description: 'Auto-approved, enhanced audit'
    tools:
      - add-incident-comment
      - create-change-request
      - tag-resource

  tier_3_require_approval:
    description: 'Human approval required'
    tools:
      - update-incident-priority
      - close-incident
      - scale-deployment

  tier_4_require_dual_approval:
    description: 'Two approvers required'
    tools:
      - execute-ssh-command
      - modify-infrastructure
      - delete-resource

Human-in-the-Loop Patterns

Approval Gate Pattern

MUST implement approval gates for high-risk operations:

approval_gate:
  name: remediation-approval
  trigger: before-execute-phase

  request:
    format: structured
    fields:
      - action-summary
      - risk-assessment
      - affected-systems
      - rollback-plan
      - estimated-duration

  approval_options:
    - approve
    - approve-with-modifications
    - reject
    - escalate

  timeout:
    duration: 15m
    action: escalate-to-secondary

  audit:
    log_request: true
    log_decision: true
    log_approver: true
    log_reasoning: true

Uncertainty Surfacing Pattern

MUST surface uncertainty explicitly:

uncertainty_handling:
  confidence_thresholds:
    high: 0.85
    medium: 0.65
    low: 0.45

  actions_by_confidence:
    high:
      - proceed-with-recommendation
      - log-confidence-score

    medium:
      - present-alternatives
      - request-human-validation
      - provide-supporting-evidence

    low:
      - halt-workflow
      - escalate-to-human
      - request-additional-context

  uncertainty_signals:
    - conflicting-evidence
    - missing-context
    - novel-scenario
    - ambiguous-request

Safety Guardrails

Enterprise Workflow Bypass Prevention

NEVER allow agents to bypass enterprise change workflows:

# Blocked patterns
blocked_operations:
  terraform:
    - terraform apply # Must go through TFE + CI/CD
    - terraform destroy # Must go through TFE + CI/CD

  kubernetes:
    - kubectl delete namespace # Must go through GitOps
    - kubectl apply -f # Must go through GitOps (prefer manifests)

  itsm:
    - close-incident-without-resolution # Must have resolution
    - skip-change-approval # Must follow CAB process

Production Safety Rules

MUST enforce production safety:

production_rules:
  read_only_by_default: true

  allowed_read_operations:
    - get-logs
    - get-metrics
    - describe-resources
    - list-pods
    - get-incident-details

  gated_write_operations:
    - restart-pod # Requires approval
    - scale-deployment # Requires approval
    - update-configmap # Requires approval + GitOps

  blocked_operations:
    - delete-pvc # Data loss risk
    - delete-namespace # Blast radius too high
    - direct-db-writes # Must use application APIs

Blast Radius Assessment

MUST assess blast radius before actions:

blast_radius_assessment:
  criteria:
    - affected_users_count
    - affected_services_count
    - data_at_risk
    - revenue_impact
    - clinical_impact

  thresholds:
    low:
      max_affected_users: 100
      max_affected_services: 2
      clinical_impact: none

    medium:
      max_affected_users: 1000
      max_affected_services: 5
      clinical_impact: none

    high:
      max_affected_users: 10000
      max_affected_services: 10
      clinical_impact: possible

    critical:
      clinical_impact: any

  actions_by_blast_radius:
    low: proceed-with-logging
    medium: require-single-approval
    high: require-dual-approval
    critical: require-leadership-approval

RAG Integration Patterns

Citation Verification

MUST verify RAG citations before execution:

rag_verification:
  required_for:
    - runbook-execution
    - configuration-changes
    - remediation-steps

  verification_steps:
    - check-source-exists
    - check-source-current # Last updated < 90 days
    - check-source-approved # In approved knowledge base
    - cross-reference-actions # Match against known procedures

  on_verification_failure:
    - flag-unverified-citation
    - request-human-validation
    - suggest-alternative-sources

Knowledge Base Boundaries

MUST constrain RAG to approved sources:

knowledge_sources:
  approved:
    - source: runbook-repository
      uri: 'https://runbooks.internal/*'
      trust_level: high

    - source: confluence-ops
      uri: 'https://confluence.internal/ops/*'
      trust_level: medium

    - source: servicenow-kb
      uri: 'servicenow://kb/*'
      trust_level: medium

  excluded:
    - external-forums
    - unverified-wikis
    - personal-notes

Observability Requirements

Correlation ID Propagation

MUST propagate correlation IDs across agents:

tracing:
  correlation_id:
    header: X-Correlation-ID
    format: uuid-v4
    propagate_to:
      - all-agent-calls
      - all-mcp-tool-calls
      - all-external-apis
      - all-log-entries

  span_creation:
    - workflow-start
    - phase-transition
    - agent-handoff
    - tool-invocation
    - approval-request

Audit Logging

MUST log all significant events:

audit_events:
  - event: workflow-started
    fields: [workflow_id, user_id, trigger_source]

  - event: agent-invoked
    fields: [agent_name, input_summary, correlation_id]

  - event: tool-called
    fields: [tool_name, mcp_server, risk_level, input_params]

  - event: approval-requested
    fields: [action, approver, risk_assessment]

  - event: approval-decision
    fields: [action, approver, decision, reasoning]

  - event: workflow-completed
    fields: [workflow_id, outcome, duration, actions_taken]

Configuration Reference

Wall-E configuration hierarchy:

# Project → Access Group → Agent → MCP → System Message → Deployment

project:
  name: platform-sre
  billing_code: PLAT-001
  owner: sre-leadership

access_groups:
  - name: sre-engineers
    permissions:
      - view-all-agents
      - use-read-only-tools
      - request-write-approvals

  - name: sre-leads
    permissions:
      - all-engineer-permissions
      - approve-tier-3-operations
      - configure-agents

agents:
  - name: incident-triage-agent
    model: gpt-4o
    temperature: 0.3
    system_message_ref: incident-triage-prompt
    mcp_servers:
      - servicenow
      - victoria-metrics

mcp_servers:
  - name: servicenow
    endpoint: https://mcp-servicenow.internal
    auth: managed-identity

  - name: victoria-metrics
    endpoint: https://mcp-victoria.internal
    auth: managed-identity

deployments:
  - environment: production
    agents: [incident-triage-agent]
    approval_tier: tier-3

Related Assets

Wall-E Agent Composition Helper

experimental

Compose multiple specialized agents into a safe Wall-E workflow with proper MCP tool assignments, guardrails, and human-in-loop gates.

claude
codex
vscode
wall-e
orchestration
multi-agent
optum

Owner: epic-platform-sre

Wall-E Workflow Designer (Optum)

experimental

Assist with designing, reviewing, and optimizing multi-agent Wall-E workflows and MCP integrations following Optum enterprise patterns.

vscode
wall-e
orchestration
multi-agent
mcp
optum

Owner: epic-platform-sre

Wall-E RAG Tuning Helper

experimental

Recommend RAG chunking, embedding, and retrieval parameters for Wall-E contexts based on corpus characteristics and performance requirements.

claude
codex
vscode
wall-e
rag
retrieval
optum

Owner: epic-platform-sre

MCP Server Development Standards (Optum)

experimental

Standards, patterns, and guardrails for building Model Context Protocol (MCP) servers compatible with Wall-E, VS Code Copilot, and enterprise systems.

claude
codex
vscode
mcp
sdk
wall-e
security
optum

Owner: epic-platform-sre

drzero-swarm

experimental

Distribute work across multiple domain specialist agents in parallel for complex multi-domain tasks

codex
drzero
swarm
parallel
multi-agent
orchestration

Owner: epic-platform-sre

abyss-v2-migration

active

Orchestrates Abyss Design System v1 to v2 migration. Auto-detects platform (web/mobile), package versions, legacy tokens, and component token overrides. Invokes child skills in optimal sequence. Use when user asks to "migrate to Abyss v2", "run v2 migration", "upgrade to Abyss v2", or wants to know "what migration work is needed". Trigger phrases include "abyss migration", "v1 to v2", "upgrade abyss".

claude
codex
vscode
abyss
migration
v2
orchestration
design-system
+1

Owner: mtaugner_uhg