Skip to content

Automatic Speech Recognition for UK Meetings with Regional Accents: A Benchmark Review


Meta Description: A 2,000-word academic review of automatic speech recognition benchmarks for UK meetings, focusing on crosstalk and regional accents. Includes comparisons of Whisper, Otter.ai, Zoom, Teams, and TurboScribe with links to The Typing Works.


Executive Summary

Automatic speech recognition (ASR) tools such as Whisper, Otter.ai, Zoom Transcription, Microsoft Teams Live Transcription, and TurboScribe are increasingly used for meeting transcription. However, in UK contexts—where meetings often involve crosstalk, overlapping dialogue, and diverse regional accents—accuracy can vary widely. Published benchmarks from the AMI Meeting Corpus and other studies show realistic word error rates (WER) of 11–14% in close-microphone settings and 19–25% (or higher) in far-field conditions. Regional accents such as Scottish, Welsh, and Northern English further degrade accuracy unless systems are adapted. Vendor claims of near-perfect performance (e.g., TurboScribe’s 99.8% accuracy) should be treated cautiously, as independent benchmarks place most systems in the mid-teens to low-twenties WER under real-world meeting conditions. For critical business needs, professional transcription services such as The Typing Works remain essential for ensuring accurate, accessible, and compliant records.


Introduction

Automatic speech recognition (ASR) has advanced rapidly in the last decade, driven by deep learning and transformer-based architectures. While general-purpose models such as OpenAI’s Whisper and commercial transcription platforms like Otter.ai, Zoom Transcription, and Microsoft Teams Live Transcription have become widely adopted, their performance varies substantially depending on use case, language variety, and recording conditions. For organisations in the UK, where meetings frequently involve regional accents and multi-speaker overlap, it is critical to understand expected transcription accuracy. This article reviews published benchmarks of ASR performance in meeting settings, with particular emphasis on datasets that reflect UK accents and conversational dynamics. A link to professional transcription services for those requiring higher accuracy is also provided (The Typing Works).

Word Error Rate and Its Significance

The primary metric used to evaluate ASR systems is the word error rate (WER). This is defined as the sum of substitutions, insertions, and deletions divided by the total number of words in the reference transcript. Lower WER indicates higher accuracy. In academic and commercial benchmarks, WER can range from under 5% for clean, single-speaker audio to over 30% for spontaneous, noisy, or multi-speaker conversations. For professional contexts such as business meetings, even modest error rates (10–15%) can significantly impact comprehension and usability.

Benchmarking ASR on Meeting Speech

The AMI Meeting Corpus has become a de facto standard dataset for evaluating ASR in meeting contexts. Recorded at the University of Edinburgh and associated institutions, the AMI dataset includes over 100 hours of meetings in English with a wide range of British and European accents, multiple speakers, spontaneous dialogue, and both headset and distant microphones.

Recent benchmarks report the following results:

  • On IHM (Individual Headset Microphone) recordings, state-of-the-art systems achieve ~11–14% WER. This setting minimises crosstalk, as each speaker wears a close microphone.1
  • On SDM (Single Distant Microphone) recordings, where multiple speakers are captured by a single room mic, WER is substantially higher: typically ~19–25% for strong systems.2
  • Some reports note that in uncontrolled SDM scenarios, WER can still exceed 35%, highlighting the challenge of overlapping speech and room acoustics.3

These figures set realistic expectations for UK meeting transcription. A clean, well-microphoned meeting might approach 12% WER, whereas a typical boardroom with cross talk and varied accents may be closer to 20–25% WER.

Impact of UK Regional Accents

ASR systems are trained predominantly on large, standardised corpora, often dominated by American or standard Southern British English. As a result, regional UK accents pose additional challenges:

  • A 2025 study on Scottish regional accents demonstrated that baseline Whisper models produced systematically higher error rates without adaptation. Accent-specific fine-tuning reduced these errors, showing that adaptation is key to handling dialectal variation.4
  • The Edinburgh International Accents of English Corpus (2023) and similar resources confirm measurable accuracy gaps between different accent groups, with Northern English, Welsh, and Scottish accents particularly prone to misrecognition.5
  • A dataset of British Isles accents (31 hours across Southern, Midlands, Northern English, Welsh, Scottish, and Irish speakers) has been used to quantify these disparities. Results consistently show that regional accent diversity increases WER relative to standardised English benchmarks.6

For UK organisations, this means that even post-processed ASR pipelines may struggle to deliver below ~15% WER in meetings with mixed regional participation unless adaptation techniques are applied.

Comparison of Major ASR Systems

Several widely used ASR platforms have been benchmarked on clinical or business data, though not always under UK-specific conditions:

  • Whisper (OpenAI): Reported ~14.8% WER on psychiatric interview audio, with a wider range (~11–20%) depending on conditions.7 Vendor benchmarks for Whisper-v2 suggest ~8% WER under cleaner conditions, though independent verification is limited.8
  • Otter.ai: Independent reports indicate WER of 12–13% under good conditions,9 but peer-reviewed tests in clinical meetings via Zoom integration show ~19% WER.7
  • Zoom Transcription: Commissioned tests reported 7.4% WER in controlled conditions,10 but in clinical meetings, Zoom/Otter transcription showed ~19% WER.7
  • Microsoft Teams Transcription: Benchmarks report ~11.5% WER in vendor-controlled tests, higher than Zoom’s reported figure.10
  • TurboScribe: Markets “99.8% accuracy”11 but provides no independent benchmarks. Given it uses Whisper as a backend, its real-world accuracy should be assumed similar to Whisper’s (~10–20% WER in UK meetings).

Meeting Audio Challenges

Meeting transcription accuracy is degraded by several well-documented factors:

  1. Crosstalk: Overlapping speech increases substitution and deletion errors. AMI SDM results illustrate this starkly, with WER nearly doubling compared to headset mic scenarios.
  2. Accents: Non-standard and regional pronunciations reduce ASR model robustness. UK regional accent studies confirm systematic degradation in WER.
  3. Acoustic conditions: Distant microphones, reverb, and room noise inflate error rates.
  4. Domain-specific vocabulary: Technical terms and organisation-specific jargon are often mis-transcribed without custom lexicons.

Improving Accuracy: Strategies

While out-of-the-box ASR systems may deliver 15–25% WER in UK meetings, several strategies can reduce errors:

  • Microphone quality: Headset or lapel microphones significantly reduce crosstalk and background noise.
  • Speaker diarisation: Identifying who is speaking can improve alignment and reduce insertion errors.
  • Post-processing with large models: Batch transcription with models like Whisper-large or adapted versions yields better performance than real-time systems.
  • Accent adaptation: Fine-tuning on corpora of regional accents improves robustness.
  • Human correction: For critical use cases, pairing ASR with professional human transcription ensures near-perfect accuracy.

Conclusion

For UK meetings involving multiple speakers, crosstalk, and regional accents, realistic expectations for automatic transcription systems are ~11–14% WER in optimal close-mic conditions and ~19–25% WER in more typical far-field settings. Regional accent diversity further increases error rates. Among popular tools, Whisper, Otter.ai, Zoom, and Teams show broadly similar performance in the mid-teens to low-twenties WER under meeting-like conditions, with vendor-reported lower figures generally reflecting more controlled test scenarios. TurboScribe’s claimed near-perfect accuracy should be treated with caution, given the absence of independent verification.

For organisations requiring reliable transcripts of UK meetings, especially where crosstalk and accent variation are present, professional human transcription remains essential. Services such as The Typing Works provide high-accuracy transcripts that complement automated solutions, ensuring clarity, compliance, and accessibility.


Footnotes

  1. MERL. ASR Pre-Training on AMI Individual Headset Microphone Corpus. 2025.
  2. MERL. Joint Separation and ASR Fine-Tuning on AMI SDM. 2025.
  3. Survey on Meeting Transcription Benchmarks. AMI Corpus SDM Error Rates. 2024.
  4. Accent Adaptation Study. Scottish Regional Accents in ASR. 2025.
  5. Edinburgh International Accents of English Corpus. 2023.
  6. British Isles Accent Dataset. HuggingFace, 2024.
  7. PubMed Central. Compliant Transcription Services for Virtual Psychiatric Interviews. 2023.
  8. Gladia Benchmark. Whisper-v2 Word Error Rate. 2024.
  9. SuperAGI Blog. Transcription Showdown: Comparing Meeting Tools. 2024.
  10. TestDevLab Report for Zoom. WER Comparison: Zoom vs Teams vs Webex. 2024.
  11. TurboScribe Marketing Website. Accessed 2025.