In the fall of 2024 we shipped a product update that passed WCAG 2.2 AA with a clean audit. Three weeks later, a user in Nairobi told us she had given up on our app. Her phone was three years old. Her screen reader was an older fork. Our "compliant" update had broken her workflow in four places the audit could not see.

This is a story we have now heard 89 times in 18 months of field research. It is not a story about bad engineers or careless designers. It is a story about a standard that was written for a different set of conditions than the ones most of our users actually live in.

This essay is our attempt to describe the gap — and what we have started doing about it.

The audit paradox

The Web Content Accessibility Guidelines are, by every reasonable measure, a remarkable achievement. They have given an entire industry a shared vocabulary, a procurement checklist, and a legal floor. Most of the wins of the last decade — captions, alt text, keyboard nav, color-contrast baselines — exist because WCAG made them measurable.

But measurable has a cost. The parts of accessibility that translate cleanly into unit tests are a fraction of the parts that matter. And the economic logic of audits pushes teams to optimise for the measurable part and stop there.

Every team we interviewed could cite their WCAG score. Only two in thirty-one could describe what their lowest-end user actually experienced on a given Tuesday. — TEWW Field Team, Internal Report 04.2

This is the audit paradox: the score goes up; the experience doesn't. We've seen products with AAA scores that real screen-reader users abandon after the first session, and products with imperfect scores that users adopt and never leave.

What the spec assumes

The 2.2 spec, read charitably, assumes a roughly modern browser, a roughly modern assistive tech stack, a reliable network, and a user who can productively solve small frictions on their own. None of these are guaranteed in the conditions we build for.

61%
of our users run a screen reader more than two major versions behind the current release
3.4s
median time-to-first-announce on a mid-tier Android phone, 3G — vs. 0.4s on a benchmark desktop
1 in 4
sessions involves at least one pause longer than 30s caused by connectivity, not the interface

These are not exotic users. They are the median of the global population we work with, and they are mostly invisible to tooling built in Mountain View or Berlin.

Five failure modes

When we reviewed the 312 session transcripts from our field work, five failure modes accounted for 74% of the usability breakdowns — and none of them show up reliably on an audit.

1. Stale-ARIA drift

Dynamic attributes (aria-expanded, aria-live) get updated by JavaScript that runs on the assumption the announcement will land before the next user input. On slower stacks, it often doesn't. Users act on a state the UI has already moved past.

2. Focus mid-flight

Modals, drawers, and toasts move focus deliberately. When network or animation delays that move, the user's next keystroke lands in the old context. The audit sees the correct focus-management code; the user sees keys going nowhere.

3. Invisible semantic drift

Frameworks ship updates that re-order DOM in ways that pass contrast and role checks but change the reading order the screen reader follows. The user re-learns the product weekly without anyone on the team noticing.

Fig. 01
Fig. 01 Abandonment rate vs. WCAG audit score, across 41 products in our benchmark set. The correlation is not what you'd expect — and the correlation with our field-usability score is stronger by a factor of 3.2.

4. Implicit literacy

Guidelines do not require that the language of a button make sense in the user's vocabulary. "Dismiss," "Authenticate," "Verify" are all technically accessible and practically opaque to users whose second or third language is the interface language.

5. The cold-start cliff

Products are tested on warm caches and configured assistive tech. Real sessions often begin cold: new device, fresh install, first-run assistive-tech pairing. The first ninety seconds are the most accessibility-hostile window in the entire product — and they are almost never tested.

If you take one thing from this piece

Audit once. Then watch a user you have never met open your product on a phone you would not use. The delta between those two experiences is your real accessibility debt.

A field-first methodology

We have spent the last year codifying a research methodology we call field-first. It does not replace WCAG; it starts where WCAG stops. The full protocol is published in our open-access research library, but the shape of it is simple:

  • Observe before measuring. Every product cycle starts with a week of passive session recordings on real devices in real homes, not labs.
  • Recruit the long tail. Our panel deliberately over-indexes on older hardware, older AT versions, and multilingual users. The panel's median device age is 3.2 years.
  • Score the friction, not the compliance. We track time-to-task, dead-key events, and recovery rate. WCAG checks come after these.
  • Ship the artefact, not the memo. Every field study ends with a working fix checked into a public repo, plus a fifteen-minute video a non-specialist engineer can watch before standup.

This is slower than an audit. It is also immeasurably more useful, and after twelve months we have enough data to say that with some confidence.

Compliance is the floor. Dignity is the ceiling. The gap between them is where most products live — and where our users still can't. — Principle 04, TEWW Research Charter

What changes on Monday

This piece is long on critique; the team I lead tries to be short on it in daily practice. If you are a designer, engineer, or PM reading this, here is what we would ask you to do this week, in order of cost:

  1. Open your product on a phone older than your current one. Take ten screenshots of things that are harder than you remembered.
  2. Pair with a colleague who uses your product differently than you do. Narrate. Don't fix. Listen.
  3. Pick one of the five failure modes above and write a test for it. One. Check it in. Do it again next week.
  4. Publish the test. The field is small and the stakes are large; our progress compounds when we share.

None of this replaces a formal audit. All of it will catch things a formal audit never will.


If you want to talk about any of this — disagreement especially welcome — I am reachable at rhea@thirdeyeworldwide.org. Everything in this essay reflects the work of a team far larger than me; the errors are mine.

References & notes

  1. TEWW Field Research Cohort, Q1 2024 – Q1 2026. 312 participants across Kenya, India, Brazil, and Egypt. Method appendix in research library.
  2. Benchmark set of 41 consumer applications, sampled for WCAG 2.2 audit presence. Audit scores self-reported; usability scores measured in field.
  3. The five failure modes emerged from open coding of session transcripts by three coders; inter-rater agreement κ = 0.81.
  4. Screen-reader version data drawn from telemetry with user consent; ≥ 2 major versions behind defined against WAI-ARIA Authoring Practices 1.2 at time of study.