Product Garaj

WorkAboutLifeContact

PM Portfolio - Project 04

AI-Native WAN
Observability

How I closed the 45-minute gap between alert and routing action at Fidelity — building L4-L7 observability for 40+ services, then adding an AI recommendation layer that applies the same anomaly-detection pattern I designed for LLM traffic routing at Modguard.

Context

Fidelity + Modguard

Role

Senior PM, Network Infra

Services Monitored

40+

Uptime Improvement

18% YoY

Domain

WAN Observability / AI

Period

2022 - 2024

The Observability Gap

At Fidelity in 2022, we had dashboards. What we did not have was decisions. The NOC team received alerts from 40+ services, opened Grafana, traced the dependency chain manually, formed a routing hypothesis, cross-referenced BGP state in a separate CLI session, validated the fix, and then acted. Average time from alert to routing action: 45 minutes.

In a trading environment, 45 minutes of degraded performance on a critical service is not a monitoring problem — it is a business continuity problem. The root issue was not lack of data. It was the absence of a layer between telemetry and action that could compress interpretation time from 45 minutes to 90 seconds.

minutes

alert-to-routing-action (before)

minutes

after WAN Copilot recommendation layer

L3 - Routing Layer

BGP session state, SD-WAN path health, MPLS circuit utilization. Route-level anomalies that precede application-layer symptoms by 8-15 minutes — the early warning layer.

4 services monitored

L4 - Transport Layer

TCP connection health, load balancer throughput, packet loss across WAN links. The cascade layer — L3 anomalies manifest here first before L7 impact.

5 services monitored

L7 - Application Layer

API gateway latency, auth service response times, SaaS connector health. Customer-visible impact layer. By the time L7 degrades, L3 and L4 have been signalling for minutes.

3 services monitored

AI Recommendation Layer

Cross-layer correlation, root cause attribution, blast radius calculation, BGP policy recommendation with confidence scoring. The decision layer that was missing.

Pattern from Modguard applied

The Pattern from Modguard

At Modguard, I shaped an enterprise AI traffic platform that was SASE-analogous for LLM workloads: multi-tenant routing with allow/deny rules, rate limiting, audit logging, anomaly detection, and failback validation. When I returned to the WAN observability problem at Fidelity, I recognized the recommendation layer was structurally identical. Whether you are routing packets or LLM requests, the decision support pattern is the same.

Modguard Pattern (LLM Traffic)

Detect anomalous LLM request pattern (latency spike, error rate)

Identify root cause: model degradation, rate limit, upstream failure

Calculate blast radius: which tenants affected, which workflows impacted

Recommend failback: route to backup model endpoint with confidence score

Audit log every recommendation and operator decision

WAN Copilot (Network Traffic)

Detect anomalous network telemetry (latency spike, packet loss, BGP MED change)

Identify root cause: ISP congestion, BGP path degradation, cascade from upstream service

Calculate blast radius: which services affected, which user populations impacted

Recommend routing action: BGP LOCAL_PREF adjustment with specific prefix and value

Audit every recommendation and operator accept/dismiss decision

The PM Insight

The columns are different. The structure is identical. A PM who has built decision-support tooling for AI traffic routing at Modguard has already solved the hardest design problem in network observability: how do you surface a recommendation with enough reasoning transparency that a human expert will trust and act on it in under 90 seconds?

WAN Copilot - Live Demo

Three active anomalies from a real degradation event at Fidelity (anonymised). Click into any anomaly to see the full root cause chain, blast radius, and BGP recommendation. Accept to simulate policy propagation.

WAN Copilot

Fidelity Enterprise WAN - 40+ services monitored

3 anomaliesLive

Anomaly Detection

Live Telemetry

SLA Risk

WAN Copilot detected 3 anomalies in the last 35 minutes. Root causes, blast radius, and routing recommendations are ready for review. Accept or dismiss each recommendation below.

highAPI GatewayLatency Spike + Packet Loss

4m ago

API Gateway latency has risen from baseline 22ms to 145ms over 18 minutes. Packet loss elevated at 1.2%. SLA breach projected within 11 days at current trajectory.

mediumDatabase ProxySustained Latency Elevation

22m ago

Database Proxy latency at 71ms vs 12ms baseline. Packet loss 0.8%. Uptime trending toward SLA threshold at current rate.

lowSD-WAN Fabric WestThroughput Degradation

31m ago

West fabric throughput at 6.1 Gbps vs 9.8 Gbps capacity. Latency elevated to 34ms. Pattern consistent with upstream ISP congestion on AS4134.

Key PM Decisions

Platform Observability vs. Selective Service Monitoring

When scoping the L4-L7 observability programme at Fidelity, the initial proposal was to instrument the 8 highest-criticality services. Network Engineering proposed full coverage of all 40+ services as a future-state goal. PM had to choose scope for v1.

Decision

Full L4-L7 coverage from day one across all 40+ services, with severity-tiered alerting rather than selective instrumentation.

Rationale

Selective monitoring creates invisible failure modes. When an uninstrumented service cascades into an instrumented one, the root cause is invisible. At Fidelity specifically, we had experienced two SLA breaches in the previous year where the root cause was a service we were not monitoring. The Database Proxy retry storm pattern from API Gateway degradation is exactly the failure mode selective monitoring would have missed.

Outcome

18% YoY uptime improvement. Zero SLA breaches in 24 months of full-coverage observability. Root cause attribution time: 47 minutes to under 4 minutes average.

Alert Routing vs. Recommended Action

The standard pattern for enterprise observability tools is alert-and-escalate: a threshold fires, a ticket is created, an engineer investigates. The question was whether to follow this pattern or invest in the recommendation layer.

Decision

AI recommendation layer surfaced alongside every alert — root cause, blast radius, specific routing action, confidence score.

Rationale

At Fidelity, the average time between alert fire and routing action was 45 minutes. The bottleneck was not detection — it was interpretation. Engineers received an alert, opened a dashboard, traced the dependency chain, formed a hypothesis, validated against BGP state, and then acted. The AI layer compresses that 45-minute process into a 90-second review cycle. Inspired directly by the anomaly-detection and failback validation flows I designed at Modguard for LLM traffic routing — the recommendation pattern is identical whether you are routing packets or LLM requests.

Outcome

Alert-to-action time: 45 minutes to under 4 minutes. Engineer on-call cognitive load reduced significantly — they validate and accept, not investigate from scratch.

Measured Outcomes

18%

YoY uptime improvement across 40+ monitored services after full L4-L7 observability programme launch at Fidelity

Fidelity network telemetry, 2023 vs 2022

4 min

mean time from alert to routing action after AI recommendation layer — down from 45 minutes with manual investigation

Fidelity NOC incident log, 2024

SLA breaches in 24 months of full-coverage observability. Two breaches in the prior 12-month period with selective monitoring.

Fidelity SLA compliance records

Fortune 500 pilot deals attributed to the AI recommendation layer at Modguard, where the same pattern was applied to LLM traffic routing

Modguard commercial pipeline, 2024

Why This Matters for Cloudflare WAN

Cloudflare Analytics for Magic WAN Is This Problem at Internet Scale

Cloudflare's network sees every packet that crosses Magic WAN. The telemetry data exists. The dashboards exist. What Cloudflare does not yet fully have — and what every enterprise WAN customer will eventually demand — is a decision layer that converts telemetry into recommended routing actions with confidence scores. The WAN Copilot pattern I built for 40 services is the product thesis for what Cloudflare Analytics should become for Magic WAN at internet scale.

Two Observability Contexts: Fidelity (Platform) and Modguard (AI Native)

Most network PMs have the platform observability experience (Fidelity: L3-L7 telemetry, uptime tracking, alert management). Very few have built AI-native recommendation tooling on top of network telemetry. The Modguard experience — where AI recommendation patterns were designed for a SASE-analogous system — is the differentiator. Cloudflare is an AI-native company. A PM who already understands how to design AI recommendations for network operators is the right person to drive this product area.

Product Portfolio - Rebecka Raj

40+ services. 45 minutes to 4 minutes. The observability layer that the network was missing.

Back to Projects