In my previous post I evaluated a number of Automatic Speech Recognition systems. That evaluation was useful but limited in an important way: it only used a single good quality audio file with a single pair of speakers (who both happened to be males with clear North American accents). Consequently there was no evaluation of performance across a variety of accents and varying audio quality etc.
To address that limitation I’ve tested 14 ASR systems with 12 different audio files, covering a range of accents and audio quality. This post presents the results.
For this evaluation I picked a number of interviews, spread over a range of years with a mix of accents and audio qualities, and used a 10 minute section of each one. Below I’ve listed some details of the audio files. Label is the identifier for the audio file used in the results table, the first two digits are the year of the recording.
|Label||MP3 Attributes (all 16-bit)||Interviewees|
|F10.A41||48 kbps, 44.1 kHz, Joint Stereo||Female, Irish accent|
|F11.A97||96 kbps, 44.1kHz, Mono||Male, Caribbean accent|
|F13.B52||64 kbps, 44.1kHz, Joint Stereo||Female, British accent|
|F14.B18||96 kbps, 44.1kHz, Mono||Female, North American accent|
|F14.C42||96 kbps, 44.1kHz, Mono||Male, North American accent|
|F15.C96||96 kbps, 44.1kHz, Mono||Male, North American accent|
|F16.D64||64 kbps, 48kHz, Mono||Male, Indian accent|
|F17.D83||64 kbps, 48kHz, Mono||Male, North American accent|
|F17.E03||64 kbps, 48kHz, Mono||Female, North American accent|
|F18.E82||128 kbps, 44.1 kHz, Joint Stereo||One male, two female (crosstalk, clipping)|
|F18.E83||256 kbps, 48 kHz, Joint Stereo||Male, French accent|
|F18.E84||256 kbps, 48 kHz, Joint Stereo||Male, North American accent|
I used roughly the same methodology as before. I purchased verbatim transcripts, made and checked by humans, from three services: Rev, Scribie, and Cielo24. I compared the transcripts and wherever they differed I listened to the audio and decided on the ‘ground truth’ to use for the evaluation.
I want to take a moment to give credit to Rev for great service. They cost $1/min yet delivered all the transcripts within 4 hours and had the lowest WER score of 3.8, compared with 4.2 for Scribie ($1/min) and 5.5 for Cielo24’s top “Best+” service ($2/min).
For Microsoft I had to convert the files to WAV format (16-bit mono 16kHz) because that’s the only format their SDK supports. Similarly for Google I converted the files to FLAC (16-bit mono 16kHz). Both are lossless conversions. All the other services accepted the original MP3 format.
The table below presents the results. The ‘Humans’ row of the table shows the median WER score for the three human transcripts. The service rows are ordered by the median of their WER scores across all 12 files. Each cell is color coded according to the degree to which the WER score is better (lower, deeper green) or worse (higher, deeper red) than the median of the ASR results for that file (shown in a middle row).
|Google Enh. Video||11.3||21||9||10||8||11||8||12||9||12||22||14||13|
|Median of ASR results||13.8||23||12||12||10||14||10||18||13||15||22||14||14|
I tested Descript as an afterthought. Descript use Google as the backend ASR service (with some custom post-processing, I’m told) and has a very nice app with a rich feature set. Testing Descript turned out to be helpful in highlighting what appears to be a bug in the Google service.
Let’s explore the odd results for F18.E82. That audio was by far the most challenging in this evaluation. There were four speakers, informal banter and cross-talking, and the audio was slightly clipped. The Human WER score of 11 reflects differences in how the humans rendered the speakers talking over one another and their disfluencies.
Google’s unusually poor result for this file was due to missing chunks of the transcript. When I first tried it there were two large chunks (~50s each) and some smaller chunks missing, and the WER score was 35! I tried rerunning the transcription, and then again with different audio formats, but it didn’t help.
A few days later I tested Descript. It scored an inconsistent mix of good and bad results with a median of 14. That seemed odd for a service that uses Google, especially as it had a better score (28) than Google for F18.E82. I retested Google and it improved to 22 (with 254 more words than in the previous Google transcript). I retested Descript and it improved to 18 (with 183 more words than in the previous Descript transcript). Those results haven’t changed with further testing. Using Google directly for that file still gets a worse result than using Descript, mostly due to Google’s transcript missing a 16 second chunk. Odd.
I regenerated Descript transcripts for the four files that had much worse results than Google’s and they all improved (F10.A41 23.6→20.4; F11.A97 14.6→9.2; F14.B18 11.87→8.14; F14.C42 14.53→11.40; F15.C96 17.60→7.63; F16.D64 15.91→12.67).
This seems like a significant problem with the Google service. I’ve reported it to Descript and had an acknowledgement but haven’t heard back yet.
I didn’t retest Trint or Sonix because, as noted in my previous post, Trint, Sonix, and Speechmatics have very little difference between their transcripts, a differential WER of just 1.4. That suggests those three services are using very similar models and training data.
VoiceBase are now represented by Cielo24, who have taken over the web service.
I had included IBM’s Watson service in this test, hoping it had improved (especially as it now takes MP3 so I didn’t need to transcode as I had before). It was consistently the worst performer, with a median WER of 24, so I dropped it from the results.
I’d also planned to include Remeeting which I came across after my previous testing and looked promising. Their results were generally similar or worse than Cielo24’s, with a couple of transcripts much worse due to extra duplicated fragments of text. They seem to do a good job with speaker identification so I’ll include them in any future testing I do for that.
I was contacted by Unravel shortly before posting this. They, like Descript, use Google to provide the transcripts. Their service is basic and their pricing is low ($15 for 300 mins/month) with a free tier (60 mins/month). While testing the service I encountered the same problem with missing chunks that I described above.
A valid concern with the previous evaluation was that a transcript for the audio I used was available on the internet and so may have been included in the training data for the ASR systems. I doubted that would make much difference in practice, given the quantity of training data needed by ASR systems, but wanted to check.
The last three files (F18.E82, F18.E83, and F18.E84) in this new evaluation were all transcribed before being published on the internet. It’s interesting to note that Scribie was one of the services I used to generate human transcripts and the Scribie Auto ASR service did unusually well on the F18.E82 file. Scribie also did well in my previous testing where I’d also used them to generate the human transcript. (The F15.C96 file in this test is a 10 minute section of that same file and again Scribie Auto ASR did unusually well on that file.)
On the other hand, Scribie Auto ASR did poorly on all the other files even though I’d used Scribie for the human transcripts of them. Similarly Cielo24 doesn’t appear to have gained noticeable advantage from having generated human transcripts of the files.
Another data point is that Microsoft performed poorly for those last three files. If those files are removed from the results then Microsoft’s ranking rises above Amazon’s.
The clear winners in this test are Google’s enhanced video ($0.048/min) and Speechmatics ($0.08/min), which came a close second on accuracy and price. (Though clearly there’s an issue with Google missing chunks in the transcript.)
TranscribeMe ($0.25/min) is relatively accurate but also three times the price and lacks features I want. Temi ($0.10/min) is only slightly worse yet less than half the price of TranscribeMe. Otter.ai ($0 up to 600mins/month, 6,000mins for $9.99/mo) is good, though not as good as they appeared to be in my previous test.
Remember, these are just my results with these specific audio files and subject matter. Your mileage will vary. Do your own testing with your own audio to work out which services will work best for you.
Automatic Speech Recognition is amazingly good, yet still far from human levels of accuracy, especially for poor quality audio. Comparing transcripts from multiple services still looks like an appealing way to identify likely errors to aid human editing.
Now there’s a clear winner (Google) I have confidence in the next step is to start generating transcripts for all the podcast episodes. Finally.
Once I’ve a workflow in place for that I can circle back and investigate how to add a workflow for human review and editing. That’s where I’d look more deeply into comparing the ‘master’ transcript from Google with another, e.g. from Speechmatics, to identify and highlight likely errors.
I also have ideas for a simple way to compare the quality of speaker identification across services, which will likely prompt another blog post, one day.
There are more of my rambling thoughts in the What Next? section of my previous post.