RRRebecka Raj
WorkAboutLifeContact
PM Portfolio - Project 04
AI-Native WAN
Observability
How I closed the 45-minute gap between alert and routing action at Fidelity — building L4-L7 observability for 40+ services, then adding an AI recommendation layer that applies the same anomaly-detection pattern I designed for LLM traffic routing at Modguard.
Context
Fidelity + Modguard
Role
Senior PM, Network Infra
Services Monitored
40+
Uptime Improvement
18% YoY
Domain
WAN Observability / AI
Period
2022 - 2024
The Observability Gap
At Fidelity in 2022, we had dashboards. What we did not have was decisions. The NOC team received alerts from 40+ services, opened Grafana, traced the dependency chain manually, formed a routing hypothesis, cross-referenced BGP state in a separate CLI session, validated the fix, and then acted. Average time from alert to routing action: 45 minutes.
In a trading environment, 45 minutes of degraded performance on a critical service is not a monitoring problem — it is a business continuity problem. The root issue was not lack of data. It was the absence of a layer between telemetry and action that could compress interpretation time from 45 minutes to 90 seconds.
45
minutes
alert-to-routing-action (before)
4
minutes
after WAN Copilot recommendation layer
L3 - Routing Layer
BGP session state, SD-WAN path health, MPLS circuit utilization. Route-level anomalies that precede application-layer symptoms by 8-15 minutes — the early warning layer.
4 services monitored
L4 - Transport Layer
TCP connection health, load balancer throughput, packet loss across WAN links. The cascade layer — L3 anomalies manifest here first before L7 impact.
5 services monitored
L7 - Application Layer
API gateway latency, auth service response times, SaaS connector health. Customer-visible impact layer. By the time L7 degrades, L3 and L4 have been signalling for minutes.
3 services monitored
AI Recommendation Layer
Cross-layer correlation, root cause attribution, blast radius calculation, BGP policy recommendation with confidence scoring. The decision layer that was missing.
Pattern from Modguard applied
The Pattern from Modguard
At Modguard, I shaped an enterprise AI traffic platform that was SASE-analogous for LLM workloads: multi-tenant routing with allow/deny rules, rate limiting, audit logging, anomaly detection, and failback validation. When I returned to the WAN observability problem at Fidelity, I recognized the recommendation layer was structurally identical. Whether you are routing packets or LLM requests, the decision support pattern is the same.
Modguard Pattern (LLM Traffic)
1
Detect anomalous LLM request pattern (latency spike, error rate)
2
Identify root cause: model degradation, rate limit, upstream failure
3
Calculate blast radius: which tenants affected, which workflows impacted
4
Recommend failback: route to backup model endpoint with confidence score
5
Audit log every recommendation and operator decision
WAN Copilot (Network Traffic)
1
Detect anomalous network telemetry (latency spike, packet loss, BGP MED change)
2
Identify root cause: ISP congestion, BGP path degradation, cascade from upstream service
3
Calculate blast radius: which services affected, which user populations impacted
4
Recommend routing action: BGP LOCAL_PREF adjustment with specific prefix and value
5
Audit every recommendation and operator accept/dismiss decision
The PM Insight
The columns are different. The structure is identical. A PM who has built decision-support tooling for AI traffic routing at Modguard has already solved the hardest design problem in network observability: how do you surface a recommendation with enough reasoning transparency that a human expert will trust and act on it in under 90 seconds?
WAN Copilot - Live Demo
Three active anomalies from a real degradation event at Fidelity (anonymised). Click into any anomaly to see the full root cause chain, blast radius, and BGP recommendation. Accept to simulate policy propagation.
WAN Copilot
Fidelity Enterprise WAN - 40+ services monitored
3 anomaliesLive
Anomaly Detection
Live Telemetry
SLA Risk
WAN Copilot detected 3 anomalies in the last 35 minutes. Root causes, blast radius, and routing recommendations are ready for review. Accept or dismiss each recommendation below.
highAPI GatewayLatency Spike + Packet Loss
4m ago
API Gateway latency has risen from baseline 22ms to 145ms over 18 minutes. Packet loss elevated at 1.2%. SLA breach projected within 11 days at current trajectory.
mediumDatabase ProxySustained Latency Elevation
22m ago
Database Proxy latency at 71ms vs 12ms baseline. Packet loss 0.8%. Uptime trending toward SLA threshold at current rate.
lowSD-WAN Fabric WestThroughput Degradation
31m ago
West fabric throughput at 6.1 Gbps vs 9.8 Gbps capacity. Latency elevated to 34ms. Pattern consistent with upstream ISP congestion on AS4134.
Key PM Decisions
Platform Observability vs. Selective Service Monitoring
When scoping the L4-L7 observability programme at Fidelity, the initial proposal was to instrument the 8 highest-criticality services. Network Engineering proposed full coverage of all 40+ services as a future-state goal. PM had to choose scope for v1.
Decision
Full L4-L7 coverage from day one across all 40+ services, with severity-tiered alerting rather than selective instrumentation.
Rationale
Selective monitoring creates invisible failure modes. When an uninstrumented service cascades into an instrumented one, the root cause is invisible. At Fidelity specifically, we had experienced two SLA breaches in the previous year where the root cause was a service we were not monitoring. The Database Proxy retry storm pattern from API Gateway degradation is exactly the failure mode selective monitoring would have missed.
Outcome
18% YoY uptime improvement. Zero SLA breaches in 24 months of full-coverage observability. Root cause attribution time: 47 minutes to under 4 minutes average.
Alert Routing vs. Recommended Action
The standard pattern for enterprise observability tools is alert-and-escalate: a threshold fires, a ticket is created, an engineer investigates. The question was whether to follow this pattern or invest in the recommendation layer.
Decision
AI recommendation layer surfaced alongside every alert — root cause, blast radius, specific routing action, confidence score.
Rationale
At Fidelity, the average time between alert fire and routing action was 45 minutes. The bottleneck was not detection — it was interpretation. Engineers received an alert, opened a dashboard, traced the dependency chain, formed a hypothesis, validated against BGP state, and then acted. The AI layer compresses that 45-minute process into a 90-second review cycle. Inspired directly by the anomaly-detection and failback validation flows I designed at Modguard for LLM traffic routing — the recommendation pattern is identical whether you are routing packets or LLM requests.
Outcome
Alert-to-action time: 45 minutes to under 4 minutes. Engineer on-call cognitive load reduced significantly — they validate and accept, not investigate from scratch.
Measured Outcomes
18%
YoY uptime improvement across 40+ monitored services after full L4-L7 observability programme launch at Fidelity
Fidelity network telemetry, 2023 vs 2022
4 min
mean time from alert to routing action after AI recommendation layer — down from 45 minutes with manual investigation
Fidelity NOC incident log, 2024
0
SLA breaches in 24 months of full-coverage observability. Two breaches in the prior 12-month period with selective monitoring.
Fidelity SLA compliance records
2
Fortune 500 pilot deals attributed to the AI recommendation layer at Modguard, where the same pattern was applied to LLM traffic routing
Modguard commercial pipeline, 2024
Why This Matters for Cloudflare WAN
Cloudflare Analytics for Magic WAN Is This Problem at Internet Scale
Cloudflare's network sees every packet that crosses Magic WAN. The telemetry data exists. The dashboards exist. What Cloudflare does not yet fully have — and what every enterprise WAN customer will eventually demand — is a decision layer that converts telemetry into recommended routing actions with confidence scores. The WAN Copilot pattern I built for 40 services is the product thesis for what Cloudflare Analytics should become for Magic WAN at internet scale.
Two Observability Contexts: Fidelity (Platform) and Modguard (AI Native)
Most network PMs have the platform observability experience (Fidelity: L3-L7 telemetry, uptime tracking, alert management). Very few have built AI-native recommendation tooling on top of network telemetry. The Modguard experience — where AI recommendation patterns were designed for a SASE-analogous system — is the differentiator. Cloudflare is an AI-native company. A PM who already understands how to design AI recommendations for network operators is the right person to drive this product area.
Product Portfolio - Rebecka Raj
40+ services. 45 minutes to 4 minutes. The observability layer that the network was missing.
Back to Projects