Designing the trust layer for an AI customer support copilot. Why agent acceptance is a design problem, not a model problem.
AI customer support copilots can lift agent productivity by 31% per day in shipped deployments. Most teams never see that lift. The bottleneck is not model quality. It is whether agents trust what the AI suggests enough to act on it. This study explores the design of that trust layer for the three internal roles that build and run these systems.
Format
Design Study
Sector
B2B SaaS, AI Customer Support
Roles addressed
Support Agent, Supervisor, CX Admin
Method
Industry research, competitive analysis, design
Year
2026
Status
2026
The numbers behind this problem.
95%
Customer interactions AI-powered by 2025
Servion Global Solutions, 2024
56 to 59%
Service agents at risk of burnout
Salesforce State of Service, 2025
47%
Enterprise AI users who made decisions based on hallucinated content in 2024
Edelman Trust Barometer and enterprise AI usage surveys, 2024
01
AI copilots are everywhere. Trust in them is not.
The AI customer service market is projected to reach $47.82 billion by 2030, with 95% of customer interactions expected to be AI-powered by 2025. In shipped deployments, the productivity numbers are real. Intercom reported that agents at Lightspeed close 31% more customer conversations daily with their Copilot. Nielsen Norman Group found that support agents using AI tools handle 13.8% more inquiries per hour. ServiceNow reported a 52% reduction in time on complex cases. The capability is no longer the question.
Trust is. Edelman's 2024 Trust Barometer showed that only 25% of US adults trust AI to provide accurate information, and global trust in AI companies dropped from 61% to 53% in a single year. Inside support teams, the picture is similar. 64% of customers say they would prefer companies did not use AI for service. 53% would consider switching to a competitor if they learned a company uses AI for service. 47% of enterprise AI users admitted making at least one major decision based on hallucinated content in 2024.
For the agents on the front line, the dynamics are sharper still. Salesforce found that 56% of customer service agents report burnout, 77% say their workload has increased over the past year, and 69% of decision-makers say agent attrition is a major operational challenge. The product question is not whether AI can help. It is whether agents trust it enough to act on it, and whether the design of that trust holds up at scale.
02
Three challenges this study set out to solve.
Challenge 01
Confidence without overconfidence.
The most dangerous AI output is not a wrong answer. It is a confidently wrong answer. PlantNet shows '92% match: Japanese Maple.' That one number transforms blind trust into informed judgment. Most copilots in customer support today either show no confidence signal, or show one so opaque the agent learns to ignore it. The design challenge is calibrating what users see to what the system actually knows.
Challenge 02
Provenance is the product, not metadata.
An AI suggestion without a source is a guess. The agent has to verify it anyway, which costs more time than writing the response themselves. Intercom's Copilot ships with inline source citations because that is what the workflow needs, not because it looks impressive. The design challenge is making the source as legible as the answer, without making the interface noisy.
Challenge 03
Failure is a designed surface.
What happens when the AI is wrong is often more important than what happens when it is right. Agents stop trusting copilots after a few high-confidence wrong answers, and that trust does not come back easily. The design challenge is treating low-confidence states, ambiguous queries, and edge cases as first-class screens, not as fallbacks.
03
What the research said. Before any screen was drawn.
This study began with three weeks of desk research. I read public industry reports on AI in customer support: Salesforce State of Service, Zendesk CX Trends, the McKinsey 2025 AI Adoption Survey, Servion's market forecasts, the NBER paper on generative AI productivity in customer support, Nielsen Norman Group's research, and the Edelman Trust Barometer on AI. I studied the design of every shipped agent assist product I could access: Intercom Fin and Copilot, Zendesk Agent Copilot, Microsoft Service Agent in Microsoft 365 Copilot, NiCE Copilot for Agents, Assembled Agent Copilot, Yuma AI, Typewise, Talkative AI Copilot, Parloa, and Minerva CQ. I read failure case studies. The patterns are clear once you look at enough of them.
Source
NBER, Generative AI at Work, 2023
Finding
Support agents using generative AI saw a 14% productivity boost on average, with the largest gains among less-experienced agents
Implication for design
Onboarding-grade help is more valuable than expert-grade help. Design for the new hire first.
Source
Intercom Lightspeed case study, 2025
Finding
Agents using Copilot closed 31% more conversations daily versus the control group
Implication for design
Speed of acceptance matters as much as accuracy. Design the suggestion to be editable, not just acceptable.
Source
Salesforce State of Service, 2025
Finding
74% of agents say AI copilots help them feel more confident on complex cases
Implication for design
Confidence is a felt property, not just a number. Design contributes to it directly.
Source
Edelman Trust Barometer, 2024
Finding
Only 25% of US adults trust AI for accurate information. Trust in AI companies dropped 8 points in one year.
Implication for design
Default user state is suspicion. Trust is earned by visible humility, not asserted by visual polish.
Source
Gartner 2024 customer survey
Finding
64% of customers prefer companies did not use AI for service. 53% would consider switching if they learned a company did.
Implication for design
Customer-facing disclosure of AI involvement is itself a design decision with revenue impact.
Source
Enterprise AI usage surveys, 2024
Finding
47% of enterprise AI users made at least one major decision based on hallucinated content
Implication for design
Hallucination is not a model problem to wait out. It is a UX problem to design around.
Source
Salesforce State of Service, 2025
Finding
56% of service agents report burnout. 77% report increased workload. 59% are at risk of work-related burnout.
Implication for design
Tooling decisions are retention decisions. The design has to lower cognitive load, not add a new layer of it.
Source
Grammarly Business and CX productivity research
Finding
Customer-facing teams spend 66% of the workweek in real-time communication, 17% above the average knowledge worker
Implication for design
Time-to-action matters more than time-to-answer. Inline beats sidebar.
Source
Yuma AI Glossier case, 2024
Finding
91% accuracy on shipping status tickets from initial deployment, sustained over months
Implication for design
Narrow scope plus governance plus validation beats broad scope plus model quality, every time.
Source
Typewise, AI suggestion acceptance rate as KPI, 2025
Finding
AI suggestion acceptance rate is the leading indicator of real-world AI value, ahead of raw accuracy
Implication for design
Track acceptance. Tag rejections. Feed both back into training. Design must support this loop.
04
Four principles every screen had to defend.
01
Show the source, not just the answer
An answer without provenance is a guess the agent has to verify anyway. Source citations belong inline with the suggestion, not behind a tooltip. Intercom's Copilot ships this pattern for a reason.
02
Calibrate the language to the certainty
'You'll love this' and 'You might like this' carry different confidence loads with zero additional UI. The model knows what it knows. The copy has to match. UX writing is a confidence signal, and it is the cheapest one to get right.
03
Edit-first, not accept-or-reject
An accept/reject pattern forces a binary on an analog problem. Most suggestions are 80% right and need a small edit. The default action should be edit-and-send, not accept-as-is. The interaction model is the trust model.
04
Failure is a designed surface, not an oversight
Low-confidence states, ambiguous queries, refused responses, and escalations are not fallbacks. They are the screens that determine whether the system gets trusted on the high-confidence ones. Design them with the same care as the success path.
05
What this study covered. What it did not.
Real AI customer support platforms have surfaces that take years to design well. This study scoped to the trust layer specifically: how the copilot communicates what it knows, surfaces what it does not, and handles its own failure. Anything outside that loop was deliberately excluded so the work could go deep rather than wide.
In scope
- Three role-based dashboards (Agent, Supervisor, CX Admin)
- Inline suggestion card with confidence state, source citation, and edit-first interaction
- Low-confidence and refusal states for the agent surface
- Auto-escalation trigger rules for the supervisor surface
- Knowledge source management and feedback loop for the CX Admin surface
- A token architecture for AI confidence states
- Customer-facing transparency disclosure pattern
Out of scope
- Customer-facing chatbot or end-user product
- Voice AI specific patterns (latency, silence detection, barge-in)
- Onboarding flows
- Settings and integrations beyond the trust layer
- Pricing, billing, and admin surfaces unrelated to AI
- Localization and multi-language behavior
- Mobile design beyond responsive principles
06
Three roles. Three views. One trust system.
Most AI copilot products ship one surface and let role-based filtering do the rest. This study split the product into three role-specific dashboards built on a shared trust layer. The Agent uses the AI in real time. The Supervisor watches what the AI is doing across the team. The CX Admin trains it. Each one asks a different question first thing in the morning, and each one needs a different trust signal to answer it.
Support Agent
Primary question
Can I send this AI suggestion as-is, or do I need to edit it?
Primary action
Read suggestion, check the cited source, edit if needed, send
Daily metric they care about
Tickets closed, time per ticket, CSAT on their conversations
Support Supervisor
Primary question
Is the team accepting AI suggestions safely, or are mistakes being shipped?
Primary action
Review flagged conversations, audit AI-assisted responses, calibrate escalation thresholds
Daily metric they care about
AI suggestion acceptance rate, edit rate, escalation rate, QA score
CX Admin (AI Ops)
Primary question
Is the knowledge base feeding the AI accurate and current?
Primary action
Manage knowledge sources, review hallucination flags, tune escalation rules, retrain on rejected suggestions
Daily metric they care about
Knowledge coverage gaps, hallucination flag count, suggestion rejection patterns, source drift
07
Five decisions, five forks, five calls this study would defend.
Decision 01 of 05
Three confidence states, not a percentage
What I considered
Show the model's raw confidence score as a percentage on every suggestion. This is the path PlantNet and several developer tools take.
What I chose
Three states: High Confidence, Moderate Confidence, Needs Verification. Each maps to a defined backend threshold. The percentage exists in the data, but the surface shows the state.
Why
Raw decimals create cognitive load and false precision. An agent under time pressure does not need to interpret 73%. They need to know whether to send or to check. The three-state pattern is the right resolution for the agent's actual decision. Engineering still computes the percentage. The UI translates.
Decision 02 of 05
Source citation inline, not in a tooltip
What I considered
Show the answer prominently, hide the source behind a hover or click. Cleaner visually.
What I chose
Source appears immediately below the suggestion, named and clickable. Snippet of the relevant text on hover. Always visible by default.
Why
Provenance is the product. Hiding it behind an interaction defeats the purpose. Intercom's Copilot ships this exact pattern because the workflow needs it. The agent does not verify if they have to work to verify. Visible sources turn verification from a chore into a glance.
Decision 03 of 05
Edit-first interaction model
What I considered
An Accept button and a Reject button as the default actions on every suggestion. Some products treat AI as a yes-or-no proposition.
What I chose
The default primary action is Send with Edit. The suggestion drops into the response field as draft text. Accept-as-is and Reject are available, but not primary.
Why
Most AI suggestions are 80% right and need a tweak. Forcing the agent to choose accept or reject is the wrong cognitive frame. Edit-first matches the actual workflow, lowers the cost of using the copilot, and produces better data on rejection (because rejection now actually means rejection, not just 'I wanted to change it').
Decision 04 of 05
Auto-escalation rules, not agent judgment alone
What I considered
Trust the agent to escalate when they sense the AI is wrong. Standard pattern. Lowest implementation cost.
What I chose
Define hard auto-escalation triggers: repeated low-confidence states on the same case, refused responses, sentiment shift in the customer message, tool-call failures, or any keyword in a defined high-risk list. These trigger escalation without requiring agent action.
Why
Agent judgment is a tired analyst's judgment by 3 PM. The system needs deterministic guardrails for high-risk cases. Yuma AI and NimbleBrain both document auto-escalation as core to keeping AI-assisted support trustworthy at scale. This is the design pattern that protects the team from the AI's own confidence bias.
Decision 05 of 05
Customer transparency at moderate confidence
What I considered
Disclose AI involvement on every response. Or never. Both are common.
What I chose
Disclose AI involvement only when confidence is moderate or the response was assembled from multiple sources. High-confidence answers grounded in a single canonical source do not require disclosure. Refused or escalated responses make the human-only path explicit.
Why
Customers say they prefer no AI in service. But what they actually want is correct answers from someone who cares. Blanket disclosure on every response trains customers to distrust the channel. No disclosure at all is dishonest. Disclosing exactly when the system is uncertain treats the customer as an adult. This is the pattern that keeps the team's CSAT while preserving honesty.
08
A trust gradient, because confidence is not one signal, it is several.
The hardest design problem in this study was not deciding whether to show confidence. It was deciding how. A single confidence number puts the work on the agent. A binary trust-or-don't signal hides too much. The right resolution is a gradient: a defined relationship between the model's internal confidence, the visible UI state, the language used in the response, and the action available to the agent.
This is the artifact that lets engineering, design, and CX operations agree on what the product should feel like at each end of the spectrum. It is the artifact that should be reviewed by a Legal team before deployment. It is the artifact that turns "trust design" from a phrase into a specification.
Trust gradient
High Confidence
Backend signal
Model confidence above defined threshold, single canonical source, no flagged ambiguity
UI treatment
Green confidence badge. Source visible. Suggestion shown as ready-to-send draft.
Language style
Direct, factual. 'Your order shipped on Tuesday and is expected Thursday.'
Available actions
Send. Edit and send. Skip.
Customer disclosure
Not required.
Moderate Confidence
Backend signal
Model confidence in middle band, or multiple sources synthesized, or ambiguous customer intent
UI treatment
Amber confidence badge. Two sources visible. Suggestion shown with a 'Verify before sending' note.
Language style
Hedged. 'Based on your order history, it looks like your shipment is scheduled for Thursday. Please confirm.'
Available actions
Edit and send. Escalate. Skip.
Customer disclosure
Yes. 'Generated with AI assistance, reviewed by [Agent name].'
Needs Verification
Backend signal
Model confidence below threshold, refused response, no source found, or auto-escalation triggered
UI treatment
Red confidence indicator. No suggestion shown. Reason for low confidence shown in plain language.
Language style
No AI-drafted response. Agent writes from scratch.
Available actions
Write manually. Escalate. Mark for retraining.
Customer disclosure
Human-only response, no AI involvement.
09
A token architecture for AI confidence states.
A product where every screen depends on a trust signal needs token architecture that makes those signals consistent. The design system for this study uses a three-layer token model: primitive, semantic, component. Primitives never appear in components. Semantic tokens carry the confidence states. Component tokens scope to the specific UI patterns that depend on them: the suggestion card, the source citation block, the confidence badge, the escalation trigger banner.
Layer 01
Primitive
Raw values, never used directly in components.
- color-green-500: #16A34A
- color-amber-500: #F59E0B
- color-red-500: #DC2626
- space-3: 12px
- font-size-sm: 13px
Layer 02
Semantic
Intent-based aliases for AI states. Components reference these.
- color-confidence-high
- color-confidence-moderate
- color-confidence-low
- color-source-citation
- color-escalation-banner
Layer 03
Component
Scoped to specific AI patterns.
- color-suggestion-card-border-high
- color-suggestion-card-border-moderate
- color-confidence-badge-text
- space-source-citation-inline
11
What comes next.
- The next step is moderated research with support agents who currently use a copilot product like Intercom, Zendesk, or Assembled. Test the three-confidence-state pattern, the edit-first interaction, and the inline source citation. Comparing measured acceptance rates against self-reported trust tells us whether the gradient maps to how agents actually decide.
- A working prototype of the suggestion card with the trust gradient, wired to a real LLM with controlled confidence scoring. Two weeks of parallel use against an existing copilot product, with suggestion acceptance rate, edit rate, escalation rate, and CSAT measured side by side. That data sharpens which confidence thresholds actually hold up in production.
- Deeper work on the supervisor dashboard. The supervisor role got a thinner treatment than the agent role. The QA workflow for AI-assisted responses, especially catching hallucinations before they ship, deserves its own exploration in a follow-up.
12
The shipped screens.
What this study produced visually.
If you are hiring for a senior product designer role in AI-integrated products and want to discuss this work or anything in my portfolio, reach me at hey@shahriarsultan.com.