Wall-E Orchestration Patterns (Optum)
Patterns and guardrails for composing safe multi-agent workflows in Wall-E (Wide Array Large Language Engine), Optum's enterprise AI orchestration platform.
Wall-E Orchestration Patterns
Overview
Wall-E (Wide Array Large Language Engine) is Optum's multi-agent orchestration platform. This guide covers patterns for composing safe, effective agent workflows.
Architecture Context
Wall-E consists of three layers:
┌─────────────────────────────────────────────────────────────┐
│ CONSUMPTION METHODS │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Portal │ │ Widget │ │ API │ │ Copilot │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
└────────┼─────────────┼─────────────┼─────────────┼──────────┘
│ │ │ │
┌────────┴─────────────┴─────────────┴─────────────┴──────────┐
│ BACKEND LAYER │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ FastAPI │ │ Redis │ │ Celery │ │ Postgres │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────┴───────────────────────────────────┐
│ MCP SERVERS │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ServiceNow│ │ Ignis │ │ SSH │ │ Victoria │ │
│ └──────────┘ └──────────┘ └──────────┘ │ Metrics │ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ └──────────┘ │
│ │ InfraRED │ │ NetScout │ │ Custom │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
Agent Contract Pattern
MUST define an agent contract for every agent:
# agent-contract.yaml
agent:
name: incident-triage-agent
version: '1.0.0'
owner: platform-sre-team
# Domain boundaries
domain:
primary: 'incident-management'
secondary: ['monitoring', 'alerting']
# Risk classification
risk:
tier: 2 # 1=low, 2=medium, 3=high, 4=critical
data_classification: internal
phi_access: false
pii_access: false
# Available capabilities
capabilities:
tools:
- name: get-incident-details
mcp_server: servicenow
risk_level: read-only
- name: get-metrics
mcp_server: victoria-metrics
risk_level: read-only
- name: add-incident-comment
mcp_server: servicenow
risk_level: low-risk-write
- name: update-incident-priority
mcp_server: servicenow
risk_level: high-risk-write
requires_approval: true
resources:
- uri_pattern: 'servicenow://incident/*'
mcp_server: servicenow
- uri_pattern: 'victoria://metrics/*'
mcp_server: victoria-metrics
# Approval requirements
approvals:
human_in_loop:
- action: update-incident-priority
condition: always
- action: close-incident
condition: always
- action: execute-remediation
condition: when_production
# Escalation path
escalation:
primary: '#platform-sre-oncall'
secondary: '[email protected]'
pager: 'platform-sre-pd'
Specialist Agent Pattern
PREFER specialist agents with narrow scopes over generalists:
# ❌ BAD: Generalist agent with broad scope
agent:
name: everything-agent
capabilities:
- incident-management
- change-management
- infrastructure
- security
- compliance
# ✅ GOOD: Specialist agents with focused domains
agents:
- name: incident-triage-agent
domain: incident-management
expertise:
- symptom-analysis
- impact-assessment
- runbook-matching
- name: metrics-analyst-agent
domain: observability
expertise:
- metric-correlation
- anomaly-detection
- trend-analysis
- name: remediation-advisor-agent
domain: incident-remediation
expertise:
- runbook-execution
- safe-command-generation
- rollback-planning
Implementation Technology
Wall-E uses pydantic-graph for workflow orchestration and pydantic-ai for agent implementation with Azure OpenAI via Stargate authentication.
Core Dependencies
Version Compatibility:
pydantic-graph: ^0.3.0 (for BaseNode, Graph, Edge)pydantic-ai: ^0.0.14 (for Agent, MCPServerStreamableHTTP)pydantic: ^2.9.0 (for BaseModel, Field)openai: ^1.54.0 (for Azure OpenAI client)
These packages are under active development. Pin to specific versions in production and test thoroughly before upgrading.
# Wall-E Core Stack
# Package versions (refer to Wall-E deployment requirements.txt):
# pydantic-graph: ^0.1.0 (graph orchestration)
# pydantic-ai: ^0.0.14 (agent framework)
# pydantic: ^2.0.0 (data validation)
from pydantic_graph import BaseNode, GraphRunContext, End, Graph
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.azure import AzureProvider
from pydantic import BaseModel, Field
from dataclasses import dataclass, field
State Management Pattern
MUST use dataclass-based state with separate namespaces:
from dataclasses import dataclass, field
from pydantic_ai.messages import ModelMessage
@dataclass
class WorkflowState:
"""State shared across all workflow nodes."""
# User input namespace
user: dict = field(default_factory=dict)
# Agent output namespace
agent: dict = field(default_factory=dict)
# Temporary buffer namespace
buffer: dict = field(default_factory=dict)
# Message history for context
message_history: list[ModelMessage] = field(default_factory=list)
def set_user_input(self, key: str, value: any) -> None:
self.user[key] = value
def get_agent_output(self, key: str) -> any:
return self.agent.get(key)
Node Implementation Pattern
MUST implement nodes using BaseNode with typed return annotations:
from pydantic_graph import BaseNode, GraphRunContext, End, Edge
from typing import Annotated
class BaseWorkflowNode(BaseNode[WorkflowState]):
"""Extended BaseNode with validation and docstrings."""
docstring_notes = True # Include docstrings in graph visualization
validation_schema = None # Optional Pydantic schema for validation
async def run(self, ctx: GraphRunContext[WorkflowState]):
raise NotImplementedError("Subclasses MUST implement run()")
def validate(self, data) -> list:
"""Validate data against schema if defined."""
if self.validation_schema:
try:
self.validation_schema.validate(data)
return []
except ValidationError as e:
return [FieldValidationError(msg=str(err["msg"]),
input=str(err["input"]))
for err in e.errors()]
return []
@dataclass
class AnalyzeRequest(BaseWorkflowNode):
"""Analyze the incoming request and classify intent."""
async def run(
self, ctx: GraphRunContext[WorkflowState]
) -> Annotated[
"ExecutePlan" | "RequestClarification",
Edge(label="classified")
]:
# Analyze request using agent
result = await analyze_agent.run(
format_as_xml({"request": ctx.state.user.get("request")})
)
if result.data.needs_clarification:
return RequestClarification()
ctx.state.agent["analysis"] = result.data
return ExecutePlan()
Workflow Composition Patterns
Diagnose → Propose → Execute → Verify Pattern
MUST structure workflows in four phases:
workflow:
name: incident-remediation
phases:
- name: diagnose
agents: [incident-triage-agent, metrics-analyst-agent]
actions:
- gather-incident-context
- analyze-metrics
- identify-affected-systems
- correlate-recent-changes
outputs:
- diagnosis-summary
- affected-components
- probable-causes
- name: propose
agents: [remediation-advisor-agent]
inputs: [diagnosis-summary, affected-components]
actions:
- match-runbooks
- generate-remediation-options
- assess-risk-per-option
- recommend-approach
outputs:
- remediation-plan
- risk-assessment
- rollback-plan
gates:
- type: human-approval
approver: on-call-engineer
timeout: 15m
- name: execute
agents: [remediation-executor-agent]
inputs: [remediation-plan, approval-token]
preconditions:
- approval-received
- rollback-plan-validated
actions:
- execute-remediation-steps
- monitor-execution
- capture-outputs
outputs:
- execution-results
- step-by-step-log
- name: verify
agents: [verification-agent, metrics-analyst-agent]
inputs: [execution-results, original-symptoms]
actions:
- verify-symptoms-resolved
- confirm-metrics-normal
- validate-no-side-effects
outputs:
- verification-report
- incident-resolution-status
Orchestrator Pattern (DocX)
Wall-E uses DocX as the primary orchestrator agent. Here is the actual implementation pattern:
from pydantic_graph import Graph
from dataclasses import dataclass
@dataclass
class StartNode(BaseWorkflowNode):
"""Entry point - initialize workflow state."""
async def run(self, ctx: GraphRunContext[WorkflowState]) -> "GreetingNode":
ctx.state.user["session_id"] = generate_session_id()
return GreetingNode()
@dataclass
class GreetingNode(BaseWorkflowNode):
"""Greet user and collect initial input."""
async def run(
self, ctx: GraphRunContext[WorkflowState]
) -> Annotated["AskQuestion", Edge(label="greeted")]:
ctx.state.agent["greeting"] = "How can I help you today?"
return AskQuestion()
@dataclass
class AskQuestion(BaseWorkflowNode):
"""Collect and validate user question."""
validation_schema = AskQuestionSchema
async def run(
self, ctx: GraphRunContext[WorkflowState]
) -> Annotated[
"Evaluate" | "Reprimand",
Edge(label="question_received")
]:
question = ctx.state.user.get("question")
errors = self.validate({"question": question})
if errors:
ctx.state.agent["error"] = errors
return Reprimand()
return Evaluate()
@dataclass
class Evaluate(BaseWorkflowNode):
"""Evaluate if question is valid and actionable."""
async def run(
self, ctx: GraphRunContext[WorkflowState]
) -> Annotated[
"GenerateSQL" | "Reprimand",
Edge(label="Good Question") | Edge(label="Bad Question")
]:
result = await evaluate_agent.run(
format_as_xml({"question": ctx.state.user.get("question")})
)
if result.data.correct:
return GenerateSQL()
else:
ctx.state.agent["reprimand_reason"] = result.data.reason
return Reprimand()
# Build the graph
docx_graph = Graph(
nodes=[StartNode, GreetingNode, AskQuestion, Evaluate,
GenerateSQL, ChooseQuery, FetchData, ContextNode,
EvalNode, RunAgain, Reprimand],
state_type=WorkflowState
)
Graph Execution Pattern
MUST execute graphs with proper error handling:
async def run_workflow(user_input: str, history: list[ModelMessage] = None):
"""Execute a Wall-E workflow with state management."""
# Initialize state
state = WorkflowState()
state.user["request"] = user_input
if history:
state.message_history = history
# Run graph
async with docx_graph.iter(
StartNode(),
state=state,
deps=WorkflowDeps()
) as graph_run:
async for node, next_node in graph_run:
# Log node transitions for observability
logger.info(f"Transition: {node.name()} → {next_node.name()}")
# Check for terminal states
if isinstance(next_node, End):
break
return state.agent.get("result")
MCP Integration Pattern
MUST proxy MCP requests through authenticated endpoints:
from httpx import AsyncClient
from fastapi import APIRouter, Request
router = APIRouter()
@router.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE"])
async def proxy_mcp_request(request: Request, path: str):
"""Proxy requests to MCP servers with authentication."""
# Get auth token from request
auth_header = request.headers.get("Authorization")
# Build target URL
target_url = f"{MCP_SERVER_BASE_URL}/{path}"
# Forward request with auth
async with AsyncClient() as client:
response = await client.request(
method=request.method,
url=target_url,
headers={"Authorization": auth_header},
content=await request.body(),
params=request.query_params
)
return Response(
content=response.content,
status_code=response.status_code,
headers=dict(response.headers)
)
MCP Server Implementation Pattern
FastMCP Server Pattern
MUST implement MCP servers using FastMCP:
from fastmcp import FastMCP
instructions = """
This MCP server provides tools for querying ServiceNow incidents.
Available tools:
- fetch_incidents: Retrieve incidents by site code
"""
mcp = FastMCP(
name="ServiceNow MCP Server",
version="1.0.0",
instructions=instructions,
)
@mcp.tool()
def fetch_incidents(site_code: str | None) -> list[dict]:
"""
Fetch active incidents from ServiceNow.
Args:
site_code: Optional site code to filter incidents
Returns:
List of incident records with number, priority, description
"""
query = build_servicenow_query(site_code)
return servicenow_client.query("incident", query)
def main():
mcp.run(transport="http", host="0.0.0.0", port=3001)
MCP Client Pattern
MUST connect to MCP servers using pydantic-ai integration:
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStreamableHTTP
async def get_agent_with_mcp(mcp_server_url: str, system_prompt: str) -> Agent:
"""Create an agent connected to an MCP server."""
# Get Azure OpenAI client via Stargate
openai_client = await get_openai_client()
model = OpenAIModel("gpt-4o", provider=AzureProvider(openai_client=openai_client))
# Configure MCP server connection
mcp_server = MCPServerStreamableHTTP(
url=mcp_server_url,
sse_read_timeout=300 # 5 minute timeout for long operations
)
return Agent(
model=model,
system_prompt=system_prompt,
mcp_servers=[mcp_server]
)
# Usage
async def main():
agent = await get_agent_with_mcp(
mcp_server_url="http://localhost:3001/mcp",
system_prompt="You are a helpful incident analyst."
)
async with agent.run_mcp_servers():
result = await agent.run("What incidents are open for site ABC123?")
print(result.output)
Agent Configuration Pattern
Database-Driven Agent Definition
Wall-E stores agent configurations in PostgreSQL:
from django.db import models
from pydantic import BaseModel, Field
class Agent(models.Model):
"""Agent configuration stored in database."""
name = models.CharField(max_length=20, unique=True)
foundation_model = models.ForeignKey("FoundationModel", on_delete=models.DO_NOTHING)
system_message = models.ForeignKey("SystemMessage", on_delete=models.DO_NOTHING)
retries = models.IntegerField(default=2)
output_schema = models.TextField() # Pydantic schema as Python code
pipe = models.TextField(default="data = output.advice") # Output transformation
tools = models.ManyToManyField("Tool", blank=True)
async def initialize(self):
"""Initialize a WalleAgent from database config."""
from chat.walle import WalleAgent
tools = [await tool.execute() async for tool in self.tools.all()]
agent = WalleAgent(
result_type=await self.validate_output_schema(),
system_prompt=await self.get_system_message(),
retries=self.retries,
tools=tools if tools else None,
)
await agent.initialize_model(agent.model)
return agent
async def validate_output_schema(self):
"""Dynamically validate and return Pydantic model from code.
SECURITY WARNING: This method uses exec() to execute user-provided code,
which presents a code injection risk. In production:
1. ONLY execute schemas from trusted sources (database with access controls)
2. NEVER accept schemas from user input or external APIs
3. Consider using a declarative format (JSON/YAML) instead of Python code
4. Implement schema validation/sanitization before exec()
5. Run in a sandboxed environment with restricted permissions
"""
namespace = {"BaseModel": BaseModel, "Field": Field}
local_ns = {}
# SECURITY: Only execute trusted schemas from controlled sources
exec(self.output_schema, namespace, local_ns)
# Find the Pydantic model in executed code
models = {
name: obj for name, obj in local_ns.items()
if isinstance(obj, type) and issubclass(obj, BaseModel)
}
if len(models) != 1:
raise ValueError("Output schema MUST define exactly one Pydantic model")
return next(iter(models.values()))
WalleAgent Implementation
MUST implement agents with Stargate authentication and retry logic:
import os
import openai
import httpx
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.azure import AzureProvider
# Validate required environment variables at module load time
REQUIRED_ENV_VARS = [
"TOKEN_URL",
"SCOPE",
"CLIENT_ID",
"CLIENT_SECRET",
"AZURE_ENDPOINT",
"API_VERSION",
"AZURE_DEPLOYMENT",
"PROJECT_ID",
"X_UPSTREAM_ENV",
]
missing_vars = [var for var in REQUIRED_ENV_VARS if not os.getenv(var)]
if missing_vars:
raise EnvironmentError(
f"Missing required environment variables: {', '.join(missing_vars)}. "
f"Set these before starting the application."
)
# Validate required environment variables at module load time
_REQUIRED_ENV_VARS = [
"AZURE_ENDPOINT",
"API_VERSION",
"AZURE_DEPLOYMENT",
"PROJECT_ID",
"X_UPSTREAM_ENV",
"TOKEN_URL",
"SCOPE",
"CLIENT_ID",
"CLIENT_SECRET",
]
_missing_vars = [var for var in _REQUIRED_ENV_VARS if not os.getenv(var)]
if _missing_vars:
raise EnvironmentError(
f"Missing required environment variables: {', '.join(_missing_vars)}. "
f"Please configure these before importing this module."
)
class WalleAgent:
"""Wall-E agent wrapper with Azure OpenAI integration."""
def __init__(
self,
result_type: type = str,
system_prompt: str = "",
retries: int = 2,
tools: list = None,
deps_type: type = None,
):
self.result_type = result_type
self.system_prompt = system_prompt
self.retries = retries
self.tools = tools or []
self.deps_type = deps_type
self.model = "gpt-4o"
self._agent = None
self._openai_client = None
async def initialize_model(self, model_name: str):
"""Initialize Azure OpenAI client via Stargate."""
access_token = await self._get_stargate_token()
self._openai_client = openai.AsyncAzureOpenAI(
azure_endpoint=os.getenv("AZURE_ENDPOINT"),
api_version=os.getenv("API_VERSION"),
azure_deployment=os.getenv("AZURE_DEPLOYMENT"),
azure_ad_token=access_token,
default_headers={
"projectId": os.getenv("PROJECT_ID"),
"x-upstream-env": os.getenv("X_UPSTREAM_ENV"),
},
)
model = OpenAIModel(
model_name,
provider=AzureProvider(openai_client=self._openai_client)
)
self._agent = Agent(
model=model,
result_type=self.result_type,
system_prompt=self.system_prompt,
retries=self.retries,
tools=self.tools,
deps_type=self.deps_type,
)
async def run(self, prompt: str, history: list = None, deps=None):
"""Run agent with automatic token refresh on auth failures."""
try:
return await self._agent.run(
prompt,
message_history=history,
deps=deps
)
except openai.AuthenticationError:
# Re-authenticate and retry
await self.initialize_model(self.model)
return await self._agent.run(
prompt,
message_history=history,
deps=deps
)
async def _get_stargate_token(self) -> str:
"""Get OAuth2 token from Stargate.
Environment variables are validated at module load time,
so this method can safely access them without additional checks.
"""
async with httpx.AsyncClient() as client:
resp = await client.post(
os.getenv("TOKEN_URL"),
headers={"Content-Type": "application/x-www-form-urlencoded"},
data={
"grant_type": "client_credentials",
"scope": os.getenv("SCOPE"),
"client_id": os.getenv("CLIENT_ID"),
"client_secret": os.getenv("CLIENT_SECRET"),
},
timeout=60
)
resp.raise_for_status()
return resp.json()["access_token"]
MCP Integration Patterns
Resource vs Tool Separation
MUST separate context (resources) from actions (tools):
# Resources = Read-only context
resources:
- name: incident-context
uri: servicenow://incident/{id}
use_for: 'providing incident details to agents'
updates: on-demand
- name: metrics-context
uri: victoria://metrics/{query}
use_for: 'providing metric data for analysis'
updates: real-time
# Tools = Actions with side effects
tools:
- name: add-comment
type: action
side_effects: true
risk: low-risk-write
- name: execute-command
type: action
side_effects: true
risk: high-risk-write
approval_required: true
Tool Risk Mapping
MUST map tools to Wall-E policy tiers:
policy_tiers:
tier_1_allow:
description: 'Auto-approved, logged'
tools:
- get-incident-details
- get-metrics
- list-pods
- describe-deployment
tier_2_allow_with_audit:
description: 'Auto-approved, enhanced audit'
tools:
- add-incident-comment
- create-change-request
- tag-resource
tier_3_require_approval:
description: 'Human approval required'
tools:
- update-incident-priority
- close-incident
- scale-deployment
tier_4_require_dual_approval:
description: 'Two approvers required'
tools:
- execute-ssh-command
- modify-infrastructure
- delete-resource
Human-in-the-Loop Patterns
Approval Gate Pattern
MUST implement approval gates for high-risk operations:
approval_gate:
name: remediation-approval
trigger: before-execute-phase
request:
format: structured
fields:
- action-summary
- risk-assessment
- affected-systems
- rollback-plan
- estimated-duration
approval_options:
- approve
- approve-with-modifications
- reject
- escalate
timeout:
duration: 15m
action: escalate-to-secondary
audit:
log_request: true
log_decision: true
log_approver: true
log_reasoning: true
Uncertainty Surfacing Pattern
MUST surface uncertainty explicitly:
uncertainty_handling:
confidence_thresholds:
high: 0.85
medium: 0.65
low: 0.45
actions_by_confidence:
high:
- proceed-with-recommendation
- log-confidence-score
medium:
- present-alternatives
- request-human-validation
- provide-supporting-evidence
low:
- halt-workflow
- escalate-to-human
- request-additional-context
uncertainty_signals:
- conflicting-evidence
- missing-context
- novel-scenario
- ambiguous-request
Safety Guardrails
Enterprise Workflow Bypass Prevention
NEVER allow agents to bypass enterprise change workflows:
# Blocked patterns
blocked_operations:
terraform:
- terraform apply # Must go through TFE + CI/CD
- terraform destroy # Must go through TFE + CI/CD
kubernetes:
- kubectl delete namespace # Must go through GitOps
- kubectl apply -f # Must go through GitOps (prefer manifests)
itsm:
- close-incident-without-resolution # Must have resolution
- skip-change-approval # Must follow CAB process
Production Safety Rules
MUST enforce production safety:
production_rules:
read_only_by_default: true
allowed_read_operations:
- get-logs
- get-metrics
- describe-resources
- list-pods
- get-incident-details
gated_write_operations:
- restart-pod # Requires approval
- scale-deployment # Requires approval
- update-configmap # Requires approval + GitOps
blocked_operations:
- delete-pvc # Data loss risk
- delete-namespace # Blast radius too high
- direct-db-writes # Must use application APIs
Blast Radius Assessment
MUST assess blast radius before actions:
blast_radius_assessment:
criteria:
- affected_users_count
- affected_services_count
- data_at_risk
- revenue_impact
- clinical_impact
thresholds:
low:
max_affected_users: 100
max_affected_services: 2
clinical_impact: none
medium:
max_affected_users: 1000
max_affected_services: 5
clinical_impact: none
high:
max_affected_users: 10000
max_affected_services: 10
clinical_impact: possible
critical:
clinical_impact: any
actions_by_blast_radius:
low: proceed-with-logging
medium: require-single-approval
high: require-dual-approval
critical: require-leadership-approval
RAG Integration Patterns
Citation Verification
MUST verify RAG citations before execution:
rag_verification:
required_for:
- runbook-execution
- configuration-changes
- remediation-steps
verification_steps:
- check-source-exists
- check-source-current # Last updated < 90 days
- check-source-approved # In approved knowledge base
- cross-reference-actions # Match against known procedures
on_verification_failure:
- flag-unverified-citation
- request-human-validation
- suggest-alternative-sources
Knowledge Base Boundaries
MUST constrain RAG to approved sources:
knowledge_sources:
approved:
- source: runbook-repository
uri: 'https://runbooks.internal/*'
trust_level: high
- source: confluence-ops
uri: 'https://confluence.internal/ops/*'
trust_level: medium
- source: servicenow-kb
uri: 'servicenow://kb/*'
trust_level: medium
excluded:
- external-forums
- unverified-wikis
- personal-notes
Observability Requirements
Correlation ID Propagation
MUST propagate correlation IDs across agents:
tracing:
correlation_id:
header: X-Correlation-ID
format: uuid-v4
propagate_to:
- all-agent-calls
- all-mcp-tool-calls
- all-external-apis
- all-log-entries
span_creation:
- workflow-start
- phase-transition
- agent-handoff
- tool-invocation
- approval-request
Audit Logging
MUST log all significant events:
audit_events:
- event: workflow-started
fields: [workflow_id, user_id, trigger_source]
- event: agent-invoked
fields: [agent_name, input_summary, correlation_id]
- event: tool-called
fields: [tool_name, mcp_server, risk_level, input_params]
- event: approval-requested
fields: [action, approver, risk_assessment]
- event: approval-decision
fields: [action, approver, decision, reasoning]
- event: workflow-completed
fields: [workflow_id, outcome, duration, actions_taken]
Configuration Reference
Wall-E configuration hierarchy:
# Project → Access Group → Agent → MCP → System Message → Deployment
project:
name: platform-sre
billing_code: PLAT-001
owner: sre-leadership
access_groups:
- name: sre-engineers
permissions:
- view-all-agents
- use-read-only-tools
- request-write-approvals
- name: sre-leads
permissions:
- all-engineer-permissions
- approve-tier-3-operations
- configure-agents
agents:
- name: incident-triage-agent
model: gpt-4o
temperature: 0.3
system_message_ref: incident-triage-prompt
mcp_servers:
- servicenow
- victoria-metrics
mcp_servers:
- name: servicenow
endpoint: https://mcp-servicenow.internal
auth: managed-identity
- name: victoria-metrics
endpoint: https://mcp-victoria.internal
auth: managed-identity
deployments:
- environment: production
agents: [incident-triage-agent]
approval_tier: tier-3
Related Assets
Wall-E Agent Composition Helper
Compose multiple specialized agents into a safe Wall-E workflow with proper MCP tool assignments, guardrails, and human-in-loop gates.
Owner: epic-platform-sre
Wall-E Workflow Designer (Optum)
Assist with designing, reviewing, and optimizing multi-agent Wall-E workflows and MCP integrations following Optum enterprise patterns.
Owner: epic-platform-sre
Wall-E RAG Tuning Helper
Recommend RAG chunking, embedding, and retrieval parameters for Wall-E contexts based on corpus characteristics and performance requirements.
Owner: epic-platform-sre
MCP Server Development Standards (Optum)
Standards, patterns, and guardrails for building Model Context Protocol (MCP) servers compatible with Wall-E, VS Code Copilot, and enterprise systems.
Owner: epic-platform-sre
drzero-swarm
Distribute work across multiple domain specialist agents in parallel for complex multi-domain tasks
Owner: epic-platform-sre
abyss-v2-migration
Orchestrates Abyss Design System v1 to v2 migration. Auto-detects platform (web/mobile), package versions, legacy tokens, and component token overrides. Invokes child skills in optimal sequence. Use when user asks to "migrate to Abyss v2", "run v2 migration", "upgrade to Abyss v2", or wants to know "what migration work is needed". Trigger phrases include "abyss migration", "v1 to v2", "upgrade abyss".
Owner: mtaugner_uhg

