WorkAboutLifeContact
LLM Report Verification
A verification platform for LLM-generated reports giving enterprise teams confidence scoring, hallucination detection, and structured review workflows so AI-drafted research can be trusted before it ships.
Client
Modguard.ai
Role
Lead Product Designer
Timeline
12 weeks
Platform
Enterprise SaaS
Domain
AI/ML Trust & Safety
Year
2024
The Challenge
AI-drafted reports look authoritative regardless of accuracy. A hallucinated statistic is indistinguishable from a verified fact and in regulated industries like healthcare and defense, a single unverified claim can trigger compliance violations or security breaches.
I led the design of Modguard’s verification platform: every claim individually scored, traceable to its source, and reviewable through a structured approval workflow. Developed through 98 user interviews and validated with enterprise customers.
Analyst Review Workspace
Walk through the verification experience from the analyst’s perspective accepting reports, drilling into flagged claims, resolving issues with source evidence, and signing off.
Meridian Corp Q3 Investment Memo
1
Report Queue2
Open Report3
Resolve Hallucination4
Research & Sources5
Resolve Remaining6
Mark as Reviewed1
Report Generation
Analyst requests a research memo on Meridian Corp’s Q3 earnings. The LLM generates a 2,400-word draft with inline confidence scoring on every claim.
SC
Reports Awaiting Review
Meridian Corp Q3 Investment Memo
2 min ago4 alerts
Accept & Review
Healthcare Sector 2025 Outlook
18 min ago9 alerts
EM Fixed Income Analysis
1 hr ago2 alerts
Back
1 / 6
Next
Measured Impact
62%
reduction in report review time with claim-level confidence scoring
A/B test, n=40 enterprise users
34%
improvement in hallucination catch rate before reports reach clients
Validation study, 12-week period
91%
of reviewers said visible AI reasoning increased their confidence in approving reports
User research interviews, n=98
3.2x
faster source verification with inline citation panels vs. traditional footnotes
Task completion analysis, n=32
Research & Discovery
Enterprise User Personas
Elena The Compliance Lead
12 years in financial complianceHealthcare enterprise
Needs absolute traceability. Every LLM report must have an audit trail. Zero tolerance for unverified claims reaching clients.
Core Pain Point
Spends 4+ hours daily manually checking AI reports against sources
Design Goal
Automated verification she can trust and defend to auditors
Marcus The Research Analyst
6 years in equity researchMid-cap investment firm
Power user drafting 10+ reports per week with LLMs. Burned by a hallucinated earnings figure that reached a client deck.
Core Pain Point
Lost trust after a high-profile error now manually verifies everything
Design Goal
Clear confidence signals he can act on quickly
Key Design Decisions
Claim-Level vs. Document-Level Confidence
A single confidence score for the entire document. Users had no way to identify which specific claims were reliable. A document could score 85% while containing a fabricated statistic.
Before
Research Memo Q3 Analysis
87% Confident
No way to identify which claims are unreliable
Every sentence individually scored across four tiers: Verified (≥90%), Sourced (75–89%), Inferred (60–74%), Speculative (<60%). Low-confidence claims auto-expand their source panel.
After
98%
89%
72%
54%
Verified
Sourced
Inferred
Speculative
Measured Impact
Report review time reduced by 62% reviewers focus on flagged claims instead of re-reading entire documents
Passive Footnotes vs. Active Reasoning Chains
Citations as numbered footnotes at the bottom. Verifying a claim required scrolling to footnote then navigating to source a 4-step process most users skipped.
Before
[1]
FOOTNOTES
[1] SEC Filing, Q3 2024, pg 23...
4 steps to verify most users skip
Inline citation chips expand to reveal step-by-step reasoning chains: what the AI retrieved, how it extracted data, what it cross-referenced, and why it assigned a confidence level.
After
2 sources
REASONING CHAIN
1
Retrieved SEC filing2
Extracted revenue data3
Cross-referenced transcript4
Confidence: 98%Measured Impact
Source verification speed improved 3.2x from avg 48s to 15s per claim
Post-Hoc Review vs. Real-Time Hallucination Alerts
Hallucinations caught during manual review if caught at all. 12% of LLM reports contained at least one factual error that reached the final draft.
Before
Error found during manual review
...or by the client
Real-time hallucination detection flags inaccuracies inline. Each flag shows the original claim, the issue, the source conflict, and a one-click correction.
After
HALLUCINATION DETECTED
Claimed: $500B → Actual: $487B
Fix
Remove
Caught before it reaches anyone
Measured Impact
Unverified claims in published reports dropped from 12% to 0.8%
Explore the Verification Platform
5 reportsLive
Overview
Unverifiable Claims
Hallucinations
The Human Layer
Total Reports
5
Avg Verifiability
83%
Avg Hallucination
8%
Edits Complete
3/5
Report Scorecard
ReportVerifiabilityHallucinationsRatingEdit Complete
Meridian Corp Q3 Investment Memo
Research Note · GPT-4o · 2400 words
87%
4%
4.0/5
YesHealthcare Sector 2025 Outlook
Market Analysis · Claude 3.5 · 3800 words
74%
12%
3.1/5
NoClient Brief Portfolio Rebalancing
Client Memo · GPT-4o · 850 words
94%
1%
4.8/5
YesESG Compliance Tech Sector
Compliance · Claude 3.5 · 4200 words
68%
18%
2.3/5
NoEM Fixed Income Analysis
Research Note · GPT-4o · 1600 words
91%
3%
4.2/5
YesComplete Verification Flow
From AI-generated draft to approved memo six steps that take an LLM report from raw output to verified, auditable deliverable.
1
Report GenerationAI System~8 seconds
Analyst requests a research memo on Meridian Corp’s Q3 earnings. The LLM generates a 2,400-word draft with inline confidence scoring on every claim.
The system processes the earnings transcript, 10-Q filing, and three analyst reports. Each claim in the report is scored as it’s generated.
2
Automated Compliance ScanAI System~3 seconds
Before the report reaches the reviewer, an automated compliance scan checks for regulatory language, unverified claims, and potential misquotations.
The scan identifies 1 high-severity hallucination (fabricated market cap figure) and 3 unverifiable claims in the draft report.
3
Reviewer TriageHuman Reviewer~30 seconds
Sarah Chen opens the Overview tab. The memo appears with its scores: 87% verifiability, 4% hallucination rate, 3 unverifiable claims to address.
She immediately sees the severity breakdown and prioritizes the hallucination flag before moving to unverifiable claims.
4
Claim-Level DrilldownHuman Reviewer~45 seconds
Sarah switches to the Unverifiable Claims tab and selects the Meridian memo. She reviews 3 flagged sentences the verification model couldn’t confirm.
Each claim shows the source reference, why verification failed, and the confidence score. She accepts two and adds a note to the third.
5
Hallucination ResolutionHuman Reviewer~2 minutes
Sarah reviews the hallucination tab. One high-severity flag: the LLM fabricated a $500B market cap figure that doesn’t appear in any source document.
She accepts the suggested fix (‘approached $490 billion’) and the hallucination rate drops from 4% to 0%. The report’s verifiability score updates in real time.
6
Final Sign-OffHuman Reviewer~15 seconds
With all flags resolved, Sarah marks the report as edit-complete in The Human Layer tab. The audit trail captures every decision.
The report is now marked as reviewed with a 4.0/5 rating, visible to the full team in the Overview.
Designed by Rebecka Raj, 2024
Patterns developed through 98 user research interviews with enterprise teams generating LLM reports in healthcare and defense. Validated with A/B testing across 40 enterprise reviewers over 12 weeks.
Back to Projects