The Kubernetes mutating webhook uses failurePolicy: Ignore. If the operator is down or the webhook times out (10s), Pods are created normally without instrumentation. No telemetry is lost — it was never injected.
flowchart LR subgraph Input["Signal Sources"] ERR["Error bursts"] LAT["Latency shifts"] DEP["Deploy events"] end subgraph Engine["Incident Engine"] CORR["Correlator"] SCORE["Scorer"] GROUP["Grouper"] end subgraph Output["Actions"] INC["Create incident"] NOTIFY["Send notification"] RCA["Trigger RCA"] PR["Open PR"] end subgraph Failures["Failure Isolation"] F_CORR["Correlator crash\n→ restart from offset\n→ no duplicate incidents"] F_NOTIFY["Notification fail\n→ retry 3x\n→ incident still created"] F_RCA["RCA timeout\n→ incident exists\n→ RCA queued"] F_PR["PR creation fail\n→ incident exists\n→ manual remediation"] end Input --> CORR --> SCORE --> GROUP GROUP --> INC INC --> NOTIFY INC --> RCA RCA --> PR CORR -.-> F_CORR NOTIFY -.-> F_NOTIFY RCA -.-> F_RCA PR -.-> F_PR
Each stage is independently failable. A notification failure does not block incident creation. An RCA timeout does not prevent the incident from being visible. A PR creation failure does not affect the incident record.