A Comparison of Automatic Speech Recognition (ASR) Systems

Back in March 2016 I wrote Semi-automated podcast transcription about my interest in finding ways to make archives of podcast content more accessible. Please read that post for details of my motivations and goals.

Some 11 months later, in February 2017, I wrote Comparing Transcriptions describing how I was exploring measuring transcription accuracy. That turned out to be more tricky, and interesting, than I’d expected. Please read that post for details of the methods I’m using and what the WER (word error rate) score means.

Here, after another over-long gap, I’m returning to post the current results, and start thinking about next steps. One cause of the delay has been that whenever I returned to the topic there had been significant changes in at least one of the results, most recently when Google announced their enhanced models. In the end the delay turned out to be helpful.

The Scores

The table below shows the results of my tests on many automated speech recognition services, ordered by WER score (lower is better). I’ll note a major caveat up front: I only used a single audio file for these tests. An almost two hour interview in English between two North American males with no strong accents and good audio quality. I can’t be sure how the results would differ for female voices, more accented voices, lower audio quality etc. I plan to retest the top tier services with at least one other file in due course.

You can’t beat a human, at least not yet. All the human services scored between 4 and 6. I described them in my previous post, so I won’t dwell on them here.

Service WER Punctuation
( . / , / ? / names )
Timing Other Features Approx Cost
(not bulk)
Human (3PlayMedia) 4.5 1261/1470/76/1064 $3/min
Human (Voicebase) 4.6 1090/1626/57/1056 $1.5/min
Human (Scribie) 5.1 923/1450/49/1153 $0.75/min
Human (Volunteer) 5.3 840/1748/60/1208 Goodwill
Google Speech-to-Text (video model, not enhanced) 10.7 792/421/29/1238 Words C, A, V $0.048/min
Otter AI 11.50 786/1166/35/1030 Pgfs E, S Free up to 600 mins/month
Spext 11.81 813/369/30/1263 Lines E $0.16/min
Go-Transcribe 12.1 979/0/0/922 Pgfs E $0.22/min
SimonSays 12.2 941/0/0/893 Line E, S $0.17/min
Trint 12.3 968/0/0/894 Lines E $0.33/min
Speechmatics 12.3 955/0/0/929 Words S, C $0.08/min
Sonix 12.3 943/0/0/900 Lines D, S, E $0.083/min+$15/mon
Temi 12.5 915/1329/51/862 Pgfs S, E $0.10/min
TranscribeMe 12.9 1203/0/63/836 Lines $0.25/min
Scribie ASR 12.9 970/1307/48/973 None E Currently free
YouTube Captions 15.0 0/0/0/1075 Lines S Currently free
Voicebase 16.6 116/0/0/1119 Lines E, V $0.02/min
AWS Transcribe 22.2 772/0/85/67 Words S, C, A, V $0.02/min
IBM Watson 25.2 11/0/0/896 Words C, A, V $0.02/min
Dragon +vocabulary 25.3 9/7/0/967 None Free + €300 for app
Deepgram 27.9 715/1262/52/443 Pgfs S, E $0.0183
SpokenData 36.5 1457/0/0/680 Words S, E $0.12/min

WER: Word error rate (lower is better).

  • Punctuation: Number of sentences / commas / question marks / capital letters not at the start of a sentence (a rough proxy for proper nouns).
  • Timing: Approximate highest precision timing: Words typically means a data format like JSON or XML with timing information for each word, Lines typically means a subtitle format like SRT, Pgfs (paragraphs) means some lower precision.
  • Other Features: E=online editor, S=speaker identification (diarisation), A=suggested alternatives, C=confidence score, V=custom vocabulary (not used in these tests).
  • Approx Cost: base cost, before any bulk discount, in USD.

Note the clustering of WER scores. After the human services scoring from 4–6, the top-tier ASR services all score 10–16, with most around 12. The scores in the next tier are roughly double: 22–28. Seems likely that the top-tier systems are using more modern technology.

For my goals I prioritise these features:

  • Accuracy is a priority, naturally, so most systems in the top-tier would do.
  • A custom vocabulary would further improve accuracy.
  • Cost. Clearly $0.02/min is much more attractive than $0.33/min when there are hundreds of hours of archives to transcribe. (I’m ignoring bulk discounts for now.)
  • Word level timing enables accurate linking to audio segments and helps enable comparison/merging of transcripts from multiple sources (such as taking punctuation from one transcript and applying it to another).
  • Good punctuation reduces the manual review effort required to polish the automated transcript into something pleasantly readable. Recognition of questions would also help with topic segmentation.
  • Speaker identification would also help identify questions and enable multiple ‘timelines’ to help resolve transcripts where there’s cross-talk.

Before Google released their updated Speech-to-Text service in April there wasn’t a clear winner for me. Now there is. Their new video premium model is significantly better than anything else I’ve tested.

I also tested their enhanced models a few weeks after I initially posted this. It didn’t help for my test file. I also tried setting interactionType and industryNaicsCodeOfAudio in the recognition metadata of the video model but that made the WER slightly worse. Perhaps they will improve over time.

Punctuation is clearly subjective but both Temi and Scribie get much closer than Google to the number of question marks and commas used by the human transcribers. Google did very well on capital letters though (a rough proxy for proper nouns).

I think we’ll see a growing ecosystem of tools and services using Google Speech-to-Text service as a backend. The Descript app is an interesting example.

Differential Analysis

While working on Comparing Transcriptions I’d realized that comparing transcripts from multiple services is a good way to find errors because they tend to make different mistakes.

So for this post I also compared most of the top-tier services against one another, i.e. using the transcript from one as the ‘ground truth’ for scoring others. A higher WER score in this test is good. It means the services are making different mistakes and those differences would highlight errors.

Google, Otter AI, Temi, Voicebase, Scribie, and TranscribeMe all scored a high WER, over 10, against all the others. Go-Transcribe vs Speechmatics had a WER of 6.1. SimonSays had a WER of 5.2 against Sonix, Trint, and Speechmatics. Trint, Sonix, and Speechmatics have very little difference between the transcripts, a WER of just 1.4. That suggests those three services are using very similar models and training data.

What Next?

My primary goal is to get the transcripts available and searchable, so the next phase would be developing a simple process to transcribe each podcast and convert the result into web pages. That much seems straightforward using the Google API. Then there’s working with the podcast host to integrate with their website, style, menus etc.

After that the steps are a more fuzzy. I’ll be crossing the river by feeling the stones…

The automated transcripts will naturally have errors that people notice (and more that they won’t). To improve the quality it’s important to make it very easy for them to contribute corrections. Being able to listen to the corresponding section of audio would be a great help. All that will require a web-based user interface backed by a service and a suitable data model.

The suggested corrections will need reviewing and merging. That will require its own low-friction workflow. I have a vague notion of using GitHub for this.

Generating transcripts from at least one other service would provide a way to highlight possible errors, in both words and punctuation. Those highlights would be useful for readers and also encourage the contribution of corrections. Otter API, Speechmatics and Voicebase are attractive low-cost options for these extra transcriptions, as are any contributed by volunteers. This kind of multi-transcription functionality has significant implications for the data model.

I’d like to directly support translations of the transcriptions. The original transcription is a moving target as corrections are submitted over time, so the translations would need to track corrections applied to the original transcription since the translation was created. Translators are also very likely to notice errors in the original, especially if they’re working from the audio.

Before getting into any design or development work, beyond the basic transcriptions, I’d want to do another round of due-dilligence research, looking for what services and open source projects might be useful components or form good foundations. Amara springs to mind. If you know of any existing projects or services that may be relevant please add a comment or let me know in some other way.

I’m not sure when, or even if, I’ll have any further updates on this hobby project. If you’re interested in helping out feel free to email me.

I hope you’ve found my rambling explorations interesting.

Updates:

  • 25th May 2018: Updated SimonSays.ai with much improved score
  • 10th June 2018: Updated notes about Google enhanced model (not helping WER score).
  • 8th September 2018: Added Otter AI, prompted by a note in a blog post by Descript comparing ASR systems.
  • 10th September 2018: Emphasised that I only used a single audio file for these tests. Noted that Otter.ai is free up to 600 mins/month.
  • 14th September 2018: Added Spext.
  • 14th September 2018: Discussion about this post on Hacker News.
  • 15th November 2018: Removed results for Vocapia at their request since they “do not consider that the testing was done in a scientifically rigorous manner”.
Advertisements

17 thoughts on “A Comparison of Automatic Speech Recognition (ASR) Systems

    • Thanks for the reminder Bo. I remember taking a look at it but not why I didn’t pursue it. Perhaps because it wasn’t sufficiently easy to use. The REST API supports audio stream only up to 15 seconds and there isn’t a client library in a language I’m familiar with. If someone is willing to do the legwork I’d be happy to supply a test file and analyse the results they get back. Or if someone knows of a commercial ASR service that using the Bing Speech to Text API then let me know and I’ll test it that way.

      • I don’t think in this comparison they were using Kaldi in an optimal way. The thesis seems to just mention using an acoustic model from Kaldi. Kaldi is a complex piece of software, but having spoken with a lot of people involved in ASR, I believe that its performance can be very close to state of the art. While the cloud services are undoubtedly easier to use than Kaldi, it would be pretty interesting to see what WER someone who understands Kaldi well could get using Kaldi on your data.

  1. Hi Tim, fantastic rambling explorations! I was just about to do something similar when I found your hobby project here. Thank you. If you would like some help to keep it current, feel free to contact me as I’d be happy to help out.

  2. Hi Tim – Thanks for doing all this work. Very thorough and clearly written. The only challenge with your work is that is dependent on one file. Testing multiple files of different quality, length, speakers will give you different results. Because every customer collects different audio, we always suggest that customers try before they buy. That way you can compare your own audio across multiple services.

    For your next bit of work to make podcasts available to the masses, you may want to look at our publishing tool which is included with Sonix: https://sonix.ai/resources/sonix-built-worlds-first-seo-friendly-mediaplayer/

    Once you have an automated transcript in Sonix, it is as simple as copying and pasting code to your website. Then you have a full player with the transcript attached making it SEO-friendly–ie. Google and other search engines can index every word.

    Here’s a quick overview of how it works: https://sonix.ai/resources/sonix-tutorial-sonix-for-radio-podcasters/ Just click on the list icon to navigate to quickly navigate to key parts.

    Thanks again for your work. We need more people like you studying our space.

    Jamie

    • I agree that using one test file limits the scope and value of the test. It was a natural consequence of the limited time I have available. I’ve updated to post to make that point bold to reduce the chance anyone would miss it. I’ve also added a note that I plan to retest the top-tier systems with at least one other file with a different speaker and audio setup.

      Regarding publishing, I’m wary of hosted solutions in general. They have many advantages and may suit many people, yet they’re also limiting in some ways and make the content availability dependent on a third-party. Your service, like others, presents the transcript as if it’s perfect. We know that it’s unlikely to be perfect until it’s been carefully reviewed and corrected. For my use-case I want to encourage the readers to review the transcript and contribute changes. Crowd-sourcing edits presents lots of challenges.

      • Hi Tim – thanks for your response and thank you for highlighting the fact that this comparison is only based on one file. As always, we recommend people test out the services first. The accuracy is highly dependent on the quality and composition of the file that is uploaded. Sonix has been independently reviewed as the most accurate transcription service.

        I’m not sure I follow the comment about content being dependent on a third-party. Can you help me understand what do you mean by that? With Sonix you can share a transcript with as many people as you want and give them editing permissions. We also store version history so it’s easy to revert back to previous versions. Is that along the lines of what you are interested in?

        • Hi Jamie. I plan to run another evaluation of the top-tier services using a short section, say 20 mins, of multiple interviews from the archives, selected to have a mix of speakers, accents, and audio quality. It’ll be interesting to see how the services compare in that test.

          Re dependency on a third-party, yes Sonix and some others provide very attractive functionality for collaborative editing and publishing. I’m sure that would suit most people. I was simply noting that there are, alongside the many benefits, some risks and limitations in adopting the use of a third-party like Sonix. If the Sonix service is down for maintenance, however rare that might be, then my users loose access. More significantly, I’m limited to whatever functionality Sonix offers, however rich that might be, and can’t provide my own features (such as other OAuth providers, just as an example).

          In practice I’d guess, at this early stage, that I might end up adopting a system where a provider like Sonix handles the collaborative editing and the transcript is published elsewhere. I’m envisaging an interactive transcript, like those on TED.com, with extra features for exploring topics, clipping and threading clips together, and of course, a low-friction way to make corrections.

  3. I’m tremendously appreciative of this study Tim! I was digging all over to find comparisons and make a decision based on a cost/accuracy ratio. I actually ended up going with Descript all from catching that little comment at the end of your post (which apparently you’ve added just a few days ago!) From the little experience I’ve had testing different files with some of the apps listed here, Descript seems to offer one of, if not the best accuracy since they’re using Google’s “video” speech-to-text as their backend. The price is still very affordable compared to others. Again, thanks for the leads and all this leg work.

  4. Pingback: New top story on Hacker News: A Comparison of Automatic Speech Recognition (ASR) Systems – Latest news

  5. Tim, an update on the VoiceBase engine: it looks like you were using version 1 of our solution via our Web App I which was readily available on our website. This version is for small use cases that don’t want or can’t program to our normal production API which is actually version 3 of our service. Version 3 has more accuracy, better pricing, custom vocabulary and an upgraded punctuation model. Would be happy to arrange a test for you if you’d like to do it. tony@voicebase.com.

    • Hello Tony. I don’t recall now if I used the web app or the API. I can see that I registered for an API token in 2016 and I have two test files. The first is dated Feb 2018 and scored a WER of 15.93. The second is dated May 2016 and scored a WER of 16.62. I’ve just used the v3 API to create a new test file today. That scored a WER of 16.94. So the results for VoiceBase, on this test, are getting slightly worse over time. (You’re not alone though, other services have also gotten a slightly worse over time. For example Trint’s WER changed from 12.0 to 12.3 between Feb 2017 and Feb 2018.)

  6. Hi Tim – Knowing pretty well the ASR technology and its evaluation, I
    am a bit puzzled by the numbers you are presenting. Some of the
    differences between systems seems too large to be realistic. For
    example, the WER of the IBM service is 2.5 times the WER of the Google
    service, I don’t think there could be such a difference between them
    since they all use basically the same underlying technology.

    Unless you are only interested in transcribing the specific data you
    used for your test, to better compare the different services you
    should be using a data set with at least 3 to 5 hours of speech and
    with at least 20 speakers (more if the data is not uniformly balanced
    over the speakers).

    Also it is highly desirable to use test data which postdates the date
    that the system’s models were built. This means the test data (or some
    closely related data) should not have already been published on the
    web to be sure the data has not been used to train the system’s
    language models (the web is an important source of texts for language
    model training). If the transcript of the audio data has been observed
    by only some of the systems, this could easily explain the ratio of 2
    in word error rate.

    I know that you do not claim to report scientifically sound results,
    but it would be much more meaningful if a well-established protocol
    was followed (cf. NIST ASR evaluations).

    Can you please provide some information about the data you used for
    your test? Was the data published on internet and if it is the case,
    was is published before or after you performed your tests?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s