Automatic transcription

AI speech-to-text accuracy: Is it good enough for your live events?

October 22, 2020 Michael Monette

AI speech-to-text accuracy: Is it good enough for your live events? image

Does today’s automatic transcription technology offer a viable alternative to traditional live transcription services? Short answer: Yes. With advancements in speech recognition technology, AI speech-to-text accuracy has reached a level that’s suitable for live events – from conference presentations and corporate meetings to university lectures and church sermons.

By no means is this a groundless conclusion. It’s based on our own research looking into the performance of leading speech recognition APIs to determine their “real-time readiness.” Read on for a breakdown of those results.

Contents

    livescrypt

    Get the best AI speech-to-text accuracy

    Capture the crystal-clear audio that’s essential for accurate AI transcription with Epiphan LiveScrypt, a dedicated automatic transcription device with inputs for professional audio (XLR/TRS) and many more powerful features.

    Discover LiveScrypt

    Methods: Assessing AI speech-to-text accuracy

    We compared three leading speech recognition application programming interfaces (APIs) – Amazon TranscribeGoogle Cloud Speech-to-Text, and IBM Watson Speech to Text – to human transcriptionists on a number of criteria:

    • Accuracy: The rate at which the solution makes mistakes in transcribing uttered words, measured as the Word Error Rate (WER [Transcript, Reference] = [Substitutions + Deletions + Insertions] / Words in Reference).
    • First-hypothesis latency: The time between the utterance of a word and the output of text.
    • Stable-hypothesis latency: The time between the utterance of a word and the output of correct text.
    • Cost: The fee for use of the associated service.

    To evaluate automatic transcription performance, we fed each API over 1,500 sample phrases from a test set made available by Texas Instruments and the Massachusetts Institute of Technology (TIMIT). We compared the results to the reference transcriptions included with the test set and measured latency. Ultimately, we decided against adjusting transcription timings for round-trip time (RTT) since RTT made up a relatively small portion of overall latency in every case.

    To establish a baseline for human transcription performance, we drew and generalized results from multiple academic sources.

    A note about terminology

    By “transcriptionist” we mean a professional who transcribes speech using a computer keyboard versus a stenographer, who would be capable of typing at higher speeds using a stenograph. The corporate, education, and special events markets tend to use transcriptionists because stenographers charge considerably higher rates.

    Regarding the TIMIT test set, the recording of those samples took place in a noise-controlled environment. We normalized the reference transcriptions by converting capital letters to lowercase, removing punctuation, and spelling out numerical terms. Then we calculated the word error rate (WER) for every utterance. Based on the complete test set, for each engine we also calculated a mean WER and WER confidence interval (two-sided, 95% confidence, t-distribution, if you want to get specific).

    Our data set did include some variance since the test phrases were made up of a variety of people speaking at different rates. But this is true to the various speaking rates, pitches, and other speech differences you’d find in real-world settings. None of the speakers were instructed to talk slowly into a microphone to make an accurate transcription more likely. Given all these precautions, we’re confident the amalgamated data is a close reflection of each API’s true accuracy.

    It’s also worth noting that our testing was in English only. English is the most widely used language in the applications we analyzed, which may mean English gets the lion’s share of developer focus. In any case, we suspect there would be only minor variances between languages.

    Results: AI and human transcription compared

    Accuracy (mean WER) First-hypothesis latency (seconds) Stable-hypothesis latency (seconds) Cost per hour (USD)
    Human (generalized) 0.04–0.09 4.2 60–200
    Amazon 0.088 2.956 3.034 1.44
    Google 0.085 0.576 0.738 1.44
    Google (Enhanced) 0.06 0.605 0.761 2.16

    *Recorded January 2020

    It’s important to note that these results reflect the state of each API in January 2020, when testing took place. Performance could only be better if we ran the same tests today since speech recognition technology, as a piece of machine learning, improves over time.

    Conclusion: AI speech-to-text accuracy is comparable to humans

    Each API achieved a level of accuracy and latency suitable for real-time captioning. The latency of Amazon’s API was a bit higher than IBM’s and Google’s engines, but the three are comparable when it comes to accuracy and cost. We also tested each engine for noise resilience (transcription accuracy in the presence of noise) and found that audio equipment quality, microphone placement, and other factors are essential for acceptable performance.

    What does all this mean in practical terms? These APIs are ready for use in live event scenarios – but how can organizations actually leverage them?

    This would require developing:

    • An automatic speech recognition edge agent to capture and stream audio data to the cloud
    • A digital signage platform and agent to receive, render, and display transcriptions
    • A Web portal or mobile application to accommodate users who are seated far from in-room monitors or who have visual impairments or vision loss

    And so on. The other, less burdensome option is to use an off-the-shelf dedicated automatic transcription device.

    LiveScrypt top down

    Accurate, affordable, and automatic live transcription

    Epiphan LiveScrypt converts speech to text in real time for display on monitors and mobile devices during live events, improving accessibility and participant engagement affordably.

    Get product details

    Get the best of automatic transcription technology today

    Powered by Google’s advanced speech recognition technology, LiveScrypt features professional audio inputs (XLR, TRS) so you can capture crystal-clear audio that’s conducive to high AI speech-to-text accuracy. LiveScrypt also includes HDMI and SDI inputs to capture embedded audio, a built-in screen for configuration, and a QR code system for easy streaming, simplifying setup and making for fewer points of failure.

    LiveScrypt diagram

    Visit https://epiphan.com/products/livescrypt to learn more about how our dedicated automatic transcription device can help make your live events more accessible and engaging.

    Leave a Reply