Product Garaj

WorkAboutLifeContact

Fidelity Investments · Case Study 01

Designing Vault at the scale of 80,000.

A Vault interface that turns a security backbone into a workflow ~80,000 associates can use safely — built around the failure mode of routing-around when the system is too hard to use.

Client

Fidelity Investments

Role

Senior Product Designer

Surface

HashiCorp Vault UI

Scale

~80,000 associates

Domain

Identity & Secrets

Year

2023-2024

The deployment, in one picture

Identity flows from Fidelity's identity provider into Vault via LDAP-synced groups. Vault holds the policies, leases, and secrets — both static (DB credentials, cloud IAM, third-party APIs) and dynamic.

The CLI and API serve the platform team. Roughly 79,500 associates downstream of every policy use the UI — that's the surface I designed.

// vault topology @ fidelity

Identity Provider

LDAP

Group sync → Vault policies

Source of truth

Vault

Policies · Leases

Static + dynamic secrets

OIDC SSO for Vault-backed apps

Consumers

Apps

DB credentials, cloud IAM, 3rd-party APIs

~80,000 associates downstream

01

80,000

associates downstream of every policy

scope · firmwide

02

LDAP

synced groups feed Vault policies

identity → vault

03

static + dynamic

DB creds, cloud IAM, 3rd-party APIs

secrets · mixed

04

79,500

use the UI — the surface I designed

surfaces · ui layer

The brief I argued for

The brief I was given, and the brief I argued for.

As briefed

"Build a UI for Vault. Self-service. Reduce platform team load."

A productivity ask. Framed as ticket-deflection, the work competes with every other UX investment for budget, and is measured in clicks rather than risk.

Funded as: platform productivity
Measured as: ticket volume, click-throughs

Reframed

When associates can't safely use Vault, they route around it.

Pasted secrets. Shadow credential stores. Audit gaps. The failure mode is a security system that becomes functionally insecure at scale because nobody can use it as intended.

Funded as: security risk control
Measured as: credential-handling hygiene

The reframe changed three things: who funded the work, what we measured success against, and which features got prioritized.

Three users, one Vault

Three different mental models of what Vault is. Designing for any one of them is straightforward; designing for all three on a shared surface is the actual problem.

Persona 01

SecOps Admin

Authors policy in HCL. Owns the rules.

What they do

Write and review path-and-capability policies
Approve high-blast-radius access grants
Audit who has what, and for how long

What scares them

Drift between authored and effective access — over-broad policies whose blast radius I can't see.

Mental model

Vault is a policy-authoring tool

Persona 02

Platform Engineer

Consumes policy. Wires apps to Vault.

What they do

Integrate services with the secrets engine
Manage lease lifecycle in production
Debug auth failures in deploy pipelines

What scares them

Lease expiry mid-deploy. Brittle auth flows that fail at 2 a.m. when nobody's watching.

Mental model

Vault is an API

Persona 03

Associate

e.g. wealth-management dev. Just needs access.

What they do

Request access to ship a feature
Use credentials inside their working session
Renew or release access when finished

What scares them

Not knowing why a request was denied — and not knowing how long my access actually lasts.

Mental model

Vault is the reason my session timed out

The shared-surface problem: the same policy object SecOps authors is what the platform engineer consumes — and what the associate is implicitly governed by. The work was making one primitive legible to three audiences at three different altitudes.

Research as a decision engine

Three decisions the research actually drove — what came in, what we chose, what changed downstream.

A · From interviews

"Associates" wasn't one persona. Infrequent requesters and embedded developers had different needs.

B · Decision

Segment the associate persona

Build two entry points instead of forcing one surface to serve both literacy levels.

C · What changed

Two surfaces shipped, not one

Simplified request flow for occasional users; inline path for embedded engineers.

A · From shadow sessions

A long-running batch job died silently. Lease expired mid-run. Failure went unnoticed for two days.

B · Decision

Make invisible failure the north star

Design for the case where state changes and nobody is watching.

C · What changed

Lease became a visible, escalating object

Surfaced renewal 15 minutes before expiry — because at expiry the deploy is already failing.

A · From telemetry

Self-service rate, broken out by segment. Nine months in, one feature wasn't moving the number.

B · Decision

Remove the feature

No defensive replacement. Carrying dead surface area is a maintenance and confusion cost.

C · What changed

Surface area reduced, focus regained

Removed in the next release. Engineering reclaimed budget for the renewal flow.

// methods

~30

contextual interviews

shadow sessions

service-blueprint workshops

9 mo

post-launch telemetry

Design principle

Make risk visible.

The system holds the signal. The user often doesn't see it.

→ The corollary

Verification should be low-cost.

The cost to confirm an action before taking it should be near zero. In security workflows, skipping verification is itself a failure mode worth designing against.

The tension we held

Every signal surfaced to a user is also a signal SecOps could act on. Compliance asked for an individualized access-activity dashboard. We declined.

Where we landed: visibility to the user about their own state. Aggregated and anonymized for org-level oversight. No individualized monitoring.

Two design moves

The RBAC-aware permission editor (for SecOps) and the lease as a living thing (for the associate). Both built on the same principle: keep the policy primitive legible, surface what the system already knows.

wealth-mgmt-readonly.hclv3 unsaved

vault.internal.fidelity.com

Capability matrix · click row to inspect

PathReadListCrtUpdDelSudo

secret/wm/clients/*

secret/wm/portfolios/*

secret/wm/transactions/+

database/creds/wm-ro

secret/wm/admin/*

// HCL preview · live
path "secret/wm/portfolios/*" {
capabilities = ["read", "list"]
}

Blast radius preview

247

Grants 247 paths across 4 secret engines

Computed on save by walking the path tree. The CLI does not expose this aggregation.

! over-broad capability flagged
secret/wm/admin/* matches 1 path with admin-level access.

Audit log · live tail

14:22EDITcapability matrix

14:18VIEWblast radius preview

14:11REQwealth/db/positions

13:54WARNover-broad capability flagged

Lease lifecycle · persona: associate · click step

1. Request

2. Approval

3. Active

4. Renewal

// approver view, full context

SC

Samantha Chen

Wealth Mgmt · requested 2m ago

awaiting

"Investigating P2 incident on positions feed — need read-only on staging-mirrored prod data."

Recent grants on this path

s.chen → wealth/db/positions · 4hexpired

s.chen → wealth/db/* · 8happroved

s.chen → wealth/db/clients · 1hexpired

Same requester, same path family. Pattern: targeted reads, none extended past TTL.

Engineering collaboration

Where UX changed the API.

response.json · before
{
"id": "db/creds/wealth/a3f1...",
"expire_time": "2024-03-14T17:18:22Z",
"renewable": true,
"ttl": 28800
}
// renewable: yes/no - until when?
// computable client-side, but logic
// drifted across UI / CLI / alerting.

response.json · after
{
"id": "db/creds/wealth/a3f1...",
"expire_time": "2024-03-14T17:18:22Z",
"renewable": true,
+ "renewable_until": "2024-03-14T21:18:22Z",
+ "max_ttl": 43200,
"ttl": 28800
}
// renewable_until is now first-class.
// One source of truth, three consumers.

PR #2841 · mergedleases: expose renewable_until in lookup response+47 / -2 · 3 commits

Vault UIUI

access portal

Renewal countdown component

credstoolCLI

ops CLI

Internal credential admin CLI

leasewatchOps

alerting

Expiry & renewal pipeline

When a design constraint is really a data-model gap, the right fix is often to change the data rather than work around it in the UI.

What the work actually moved

Handoff time

30%↓

Reduction in design-to-dev handoff. Component library shipped with shared primitives.

Request time

3 days→4h

Median access request time. From Jira ticket to live lease, across all five business domains.

Self-service rate

0%→71%

Associate self-service rate. Requests resolved without platform team intervention.

// the lesson I'm taking with me

Security UX is fundamentally an information-design problem — making state, lifecycle, and risk visible at the right time, in the right amount, to the right person.

// platform thesis

The patterns generalize.

Blast radius preview, lease as countdown, legibility-over-abstraction — none of it is Vault-specific. It's the approach for any product where users take actions whose consequences they can't fully see.

Terraform planBoundary sessionConsul service intent

Designed by Rebecka Raj · Fidelity Investments · 2023-2024

I've been doing this work on a surface that sits on top of Vault. The next step is to do it inside Vault, and across the products that sit alongside it.

Back to Projects