A Comparison of Automatic Speech Recognition (ASR) Systems

Back in March 2016 I wrote Semi-automated podcast transcription about my interest in finding ways to make archives of podcast content more accessible. Please read that post for details of my motivations and goals.

Some 11 months later, in February 2017, I wrote Comparing Transcriptions describing how I was exploring measuring transcription accuracy. That turned out to be more tricky, and interesting, than I’d expected. Please read that post for details of the methods I’m using and what the WER (word error rate) score means.

Here, after another over-long gap, I’m returning to post the current results, and start thinking about next steps. One cause of the delay has been that whenever I returned to the topic there had been significant changes in at least one of the results, most recently when Google announced their enhanced models. In the end the delay turned out to be helpful.

The Scores

The table below shows the results of my tests on many automated speech recognition services, ordered by WER score (lower is better). I’ll note a major caveat up front: I only used a single audio file for these tests. An almost two hour interview in English between two North American males with no strong accents and good audio quality. I can’t be sure how the results would differ for female voices, more accented voices, lower audio quality etc. I plan to retest the top tier services with at least one other file in due course.

You can’t beat a human, at least not yet. All the human services scored between 4 and 6. I described them in my previous post, so I won’t dwell on them here.

Service WER Punctuation
( . / , / ? / names )
Timing Other Features Approx Cost
(not bulk)
Human (3PlayMedia) 4.5 1261/1470/76/1064 $3/min
Human (Voicebase) 4.6 1090/1626/57/1056 $1.5/min
Human (Scribie) 5.1 923/1450/49/1153 $0.75/min
Human (Volunteer) 5.3 840/1748/60/1208 Goodwill
Google Speech-to-Text (video model, not enhanced) 10.7 792/421/29/1238 Words C, A, V $0.048/min
Otter AI 11.50 786/1166/35/1030 Pgfs E, S Free up to 600 mins/month
Spext 11.81 813/369/30/1263 Lines E $0.16/min
Go-Transcribe 12.1 979/0/0/922 Pgfs E $0.22/min
SimonSays 12.2 941/0/0/893 Line E, S $0.17/min
Trint 12.3 968/0/0/894 Lines E $0.33/min
Speechmatics 12.3 955/0/0/929 Words S, C $0.08/min
Sonix 12.3 943/0/0/900 Lines D, S, E $0.083/min+$15/mon
Temi 12.5 915/1329/51/862 Pgfs S, E $0.10/min
TranscribeMe 12.9 1203/0/63/836 Lines $0.25/min
Scribie ASR 12.9 970/1307/48/973 None E Currently free
YouTube Captions 15.0 0/0/0/1075 Lines S Currently free
Voicebase 16.6 116/0/0/1119 Lines E, V $0.02/min
AWS Transcribe 22.2 772/0/85/67 Words S, C, A, V $0.02/min
Vocapia VoxSigma 23.6 771/599/0/931 Words S, C $0.02/min approx
IBM Watson 25.2 11/0/0/896 Words C, A, V $0.02/min
Dragon +vocabulary 25.3 9/7/0/967 None Free + €300 for app
Deepgram 27.9 715/1262/52/443 Pgfs S, E $0.0183
SpokenData 36.5 1457/0/0/680 Words S, E $0.12/min

WER: Word error rate (lower is better).

  • Punctuation: Number of sentences / commas / question marks / capital letters not at the start of a sentence (a rough proxy for proper nouns).
  • Timing: Approximate highest precision timing: Words typically means a data format like JSON or XML with timing information for each word, Lines typically means a subtitle format like SRT, Pgfs (paragraphs) means some lower precision.
  • Other Features: E=online editor, S=speaker identification (diarisation), A=suggested alternatives, C=confidence score, V=custom vocabulary (not used in these tests).
  • Approx Cost: base cost, before any bulk discount, in USD.

Note the clustering of WER scores. After the human services scoring from 4–6, the top-tier ASR services all score 10–16, with most around 12. The scores in the next tier are roughly double: 22–28. Seems likely that the top-tier systems are using more modern technology.

For my goals I prioritise these features:

  • Accuracy is a priority, naturally, so most systems in the top-tier would do.
  • A custom vocabulary would further improve accuracy.
  • Cost. Clearly $0.02/min is much more attractive than $0.33/min when there are hundreds of hours of archives to transcribe. (I’m ignoring bulk discounts for now.)
  • Word level timing enables accurate linking to audio segments and helps enable comparison/merging of transcripts from multiple sources (such as taking punctuation from one transcript and applying it to another).
  • Good punctuation reduces the manual review effort required to polish the automated transcript into something pleasantly readable. Recognition of questions would also help with topic segmentation.
  • Speaker identification would also help identify questions and enable multiple ‘timelines’ to help resolve transcripts where there’s cross-talk.

Before Google released their updated Speech-to-Text service in April there wasn’t a clear winner for me. Now there is. Their new video premium model is significantly better than anything else I’ve tested.

I also tested their enhanced models a few weeks after I initially posted this. It didn’t help for my test file. I also tried setting interactionType and industryNaicsCodeOfAudio in the recognition metadata of the video model but that made the WER slightly worse. Perhaps they will improve over time.

Punctuation is clearly subjective but both Temi and Scribie get much closer than Google to the number of question marks and commas used by the human transcribers. Google did very well on capital letters though (a rough proxy for proper nouns).

I think we’ll see a growing ecosystem of tools and services using Google Speech-to-Text service as a backend. The Descript app is an interesting example.

Differential Analysis

While working on Comparing Transcriptions I’d realized that comparing transcripts from multiple services is a good way to find errors because they tend to make different mistakes.

So for this post I also compared most of the top-tier services against one another, i.e. using the transcript from one as the ‘ground truth’ for scoring others. A higher WER score in this test is good. It means the services are making different mistakes and those differences would highlight errors.

Google, Otter AI, Temi, Voicebase, Scribie, and TranscribeMe all scored a high WER, over 10, against all the others. Go-Transcribe vs Speechmatics had a WER of 6.1. SimonSays had a WER of 5.2 against Sonix, Trint, and Speechmatics. Trint, Sonix, and Speechmatics have very little difference between the transcripts, a WER of just 1.4. That suggests those three services are using very similar models and training data.

What Next?

My primary goal is to get the transcripts available and searchable, so the next phase would be developing a simple process to transcribe each podcast and convert the result into web pages. That much seems straightforward using the Google API. Then there’s working with the podcast host to integrate with their website, style, menus etc.

After that the steps are a more fuzzy. I’ll be crossing the river by feeling the stones…

The automated transcripts will naturally have errors that people notice (and more that they won’t). To improve the quality it’s important to make it very easy for them to contribute corrections. Being able to listen to the corresponding section of audio would be a great help. All that will require a web-based user interface backed by a service and a suitable data model.

The suggested corrections will need reviewing and merging. That will require its own low-friction workflow. I have a vague notion of using GitHub for this.

Generating transcripts from at least one other service would provide a way to highlight possible errors, in both words and punctuation. Those highlights would be useful for readers and also encourage the contribution of corrections. Otter API, Speechmatics and Voicebase are attractive low-cost options for these extra transcriptions, as are any contributed by volunteers. This kind of multi-transcription functionality has significant implications for the data model.

I’d like to directly support translations of the transcriptions. The original transcription is a moving target as corrections are submitted over time, so the translations would need to track corrections applied to the original transcription since the translation was created. Translators are also very likely to notice errors in the original, especially if they’re working from the audio.

Before getting into any design or development work, beyond the basic transcriptions, I’d want to do another round of due-dilligence research, looking for what services and open source projects might be useful components or form good foundations. Amara springs to mind. If you know of any existing projects or services that may be relevant please add a comment or let me know in some other way.

I’m not sure when, or even if, I’ll have any further updates on this hobby project. If you’re interested in helping out feel free to email me.

I hope you’ve found my rambling explorations interesting.

Updates:

  • 25th May 2018: Updated SimonSays.ai with much improved score
  • 10th June 2018: Updated notes about Google enhanced model (not helping WER score).
  • 8th September 2018: Added Otter AI, prompted by a note in a blog post by Descript comparing ASR systems.
  • 10th September 2018: Emphasised that I only used a single audio file for these tests. Noted that Otter.ai is free up to 600 mins/month.
  • 14th September 2018: Added Spext.
  • 14th September 2018: Discussion about this post on Hacker News.
Advertisements

14 thoughts on “A Comparison of Automatic Speech Recognition (ASR) Systems

    • Thanks for the reminder Bo. I remember taking a look at it but not why I didn’t pursue it. Perhaps because it wasn’t sufficiently easy to use. The REST API supports audio stream only up to 15 seconds and there isn’t a client library in a language I’m familiar with. If someone is willing to do the legwork I’d be happy to supply a test file and analyse the results they get back. Or if someone knows of a commercial ASR service that using the Bing Speech to Text API then let me know and I’ll test it that way.

      • I don’t think in this comparison they were using Kaldi in an optimal way. The thesis seems to just mention using an acoustic model from Kaldi. Kaldi is a complex piece of software, but having spoken with a lot of people involved in ASR, I believe that its performance can be very close to state of the art. While the cloud services are undoubtedly easier to use than Kaldi, it would be pretty interesting to see what WER someone who understands Kaldi well could get using Kaldi on your data.

  1. Hi Tim, fantastic rambling explorations! I was just about to do something similar when I found your hobby project here. Thank you. If you would like some help to keep it current, feel free to contact me as I’d be happy to help out.

  2. Hi Tim – Thanks for doing all this work. Very thorough and clearly written. The only challenge with your work is that is dependent on one file. Testing multiple files of different quality, length, speakers will give you different results. Because every customer collects different audio, we always suggest that customers try before they buy. That way you can compare your own audio across multiple services.

    For your next bit of work to make podcasts available to the masses, you may want to look at our publishing tool which is included with Sonix: https://sonix.ai/resources/sonix-built-worlds-first-seo-friendly-mediaplayer/

    Once you have an automated transcript in Sonix, it is as simple as copying and pasting code to your website. Then you have a full player with the transcript attached making it SEO-friendly–ie. Google and other search engines can index every word.

    Here’s a quick overview of how it works: https://sonix.ai/resources/sonix-tutorial-sonix-for-radio-podcasters/ Just click on the list icon to navigate to quickly navigate to key parts.

    Thanks again for your work. We need more people like you studying our space.

    Jamie

    • I agree that using one test file limits the scope and value of the test. It was a natural consequence of the limited time I have available. I’ve updated to post to make that point bold to reduce the chance anyone would miss it. I’ve also added a note that I plan to retest the top-tier systems with at least one other file with a different speaker and audio setup.

      Regarding publishing, I’m wary of hosted solutions in general. They have many advantages and may suit many people, yet they’re also limiting in some ways and make the content availability dependent on a third-party. Your service, like others, presents the transcript as if it’s perfect. We know that it’s unlikely to be perfect until it’s been carefully reviewed and corrected. For my use-case I want to encourage the readers to review the transcript and contribute changes. Crowd-sourcing edits presents lots of challenges.

  3. I’m tremendously appreciative of this study Tim! I was digging all over to find comparisons and make a decision based on a cost/accuracy ratio. I actually ended up going with Descript all from catching that little comment at the end of your post (which apparently you’ve added just a few days ago!) From the little experience I’ve had testing different files with some of the apps listed here, Descript seems to offer one of, if not the best accuracy since they’re using Google’s “video” speech-to-text as their backend. The price is still very affordable compared to others. Again, thanks for the leads and all this leg work.

  4. Pingback: New top story on Hacker News: A Comparison of Automatic Speech Recognition (ASR) Systems – Latest news

  5. Tim, an update on the VoiceBase engine: it looks like you were using version 1 of our solution via our Web App I which was readily available on our website. This version is for small use cases that don’t want or can’t program to our normal production API which is actually version 3 of our service. Version 3 has more accuracy, better pricing, custom vocabulary and an upgraded punctuation model. Would be happy to arrange a test for you if you’d like to do it. tony@voicebase.com.

    • Hello Tony. I don’t recall now if I used the web app or the API. I can see that I registered for an API token in 2016 and I have two test files. The first is dated Feb 2018 and scored a WER of 15.93. The second is dated May 2016 and scored a WER of 16.62. I’ve just used the v3 API to create a new test file today. That scored a WER of 16.94. So the results for VoiceBase, on this test, are getting slightly worse over time. (You’re not alone though, other services have also gotten a slightly worse over time. For example Trint’s WER changed from 12.0 to 12.3 between Feb 2017 and Feb 2018.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s