Back in March 2016 I wrote Semi-automated podcast transcription about my interest in finding ways to make archives of podcast content more accessible. Please read that post for details of my motivations and goals.
Some 11 months later, in February 2017, I wrote Comparing Transcriptions describing how I was exploring measuring transcription accuracy. That turned out to be more tricky, and interesting, than I’d expected. Please read that post for details of the methods I’m using and what the WER (word error rate) score means.
Here, after another over-long gap, I’m returning to post the current results, and start thinking about next steps. One cause of the delay has been that whenever I returned to the topic there had been significant changes in at least one of the results, most recently when Google announced their enhanced models. In the end the delay turned out to be helpful.
The table below shows the results of my tests on many automated speech recognition services, ordered by WER score (lower is better). I’ll note a major caveat up front: I only used a single audio file for these tests. An almost two hour interview in English between two North American males with no strong accents and good audio quality. I can’t be sure how the results would differ for female voices, more accented voices, lower audio quality etc.
You can’t beat a human, at least not yet. All the human services scored between 4 and 6. I described them in my previous post, so I won’t dwell on them here.
|Google Text-to-Speech (video model, not enhanced)||10.7||792/421/29/1238||Words||C, A, V||$0.048/min|
|Sonix||12.3||943/0/0/900||Lines||D, S, E||$0.083/min+$15/mon|
|Scribie ASR||12.9||970/1307/48/973||None||E||Currently free|
|YouTube Captions||15.0||0/0/0/1075||Lines||S||Currently free|
|AWS Transcribe||22.2||772/0/85/67||Words||S, C, A, V||$0.02/min|
|Vocapia VoxSigma||23.6||771/599/0/931||Words||S, C||$0.02/min approx|
|IBM Watson||25.2||11/0/0/896||Words||C, A, V||$0.02/min|
|Dragon +vocabulary||25.3||9/7/0/967||None||Free + €300 for app|
- WER: Word error rate (lower is better).
- Punctuation: Number of sentences / commas / question marks / capital letters (other than at the start of a sentence).
- Timing: Approximate highest precision timing: Words typically means a data format like JSON or XML with timing information for each word, Lines typically means a subtitle format like SRT, Pgfs (paragraphs) means some lower precision.
- Other Features: E=online editor, S=speaker identification (diarisation), A=suggested alternatives, C=confidence score, V=custom vocabulary (not used in these tests).
- Approx Cost: base cost, before any bulk discount, in USD.
Note the clustering of WER scores. After the human services scoring from 4–6, the top-tier ASR services all score 10–16, with most around 12. The scores in the next tier are roughly double: 22–28. Seems likely that the top-tier systems are using more modern technology.
For my goals I prioritise these features:
- Accuracy is a priority, naturally, so most systems in the top-tier would do.
- A custom vocabulary would further improve accuracy.
- Cost. Clearly $0.02/min is much more attractive than $0.33/min when there are hundreds of hours of archives to transcribe. (I’m ignoring bulk discounts for now.)
- Word level timing enables accurate linking to audio segments and helps enable comparison/merging of transcripts from multiple sources (such as taking punctuation from one transcript and applying it to another).
- Good punctuation reduces the manual review effort required to polish the automated transcript into something pleasantly readable. Recognition of questions would also help with topic segmentation.
- Speaker identification would also help identify questions and enable multiple ‘timelines’ to help resolve transcripts where there’s cross-talk.
Before Google released their updated Speech-to-Text service in April there wasn’t a clear winner for me. Now there is. Their new
video premium model is significantly better than anything else I’ve tested.
I’ve not been able to test their enhanced models yet, presumably due to teething troubles. I did try setting
industryNaicsCodeOfAudio in the recognition metadata of the video model but that made the WER slightly worse. Perhaps they will be of more use with the enhanced model.
Punctuation is clearly subjective but both Temi and Scribie get much closer than Google to the number of question marks and commas used by the human transcribers. Google did very well on capital letters though (a rough proxy for proper nouns).
I think we’ll see a growing ecosystem of tools and services using Google Speech-to-Text service as a backend. The Descript app is an interesting example.
While working on Comparing Transcriptions I’d realized that comparing transcripts from multiple services is a good way to find errors.
So for this post I also compared most of the top-tier services against one another, i.e. using the transcript from one as the ‘ground truth’ for scoring others. A higher WER score in this test is good. It means the services are making different mistakes and those differences would highlight errors.
Google, Temi, Voicebase, Scribie, and TranscribeMe all scored a high WER, over 10, against all the others. Go-Transcribe had a WER of 3.6 against Trint and 6.1 against Speechmatics. Trint, Sonix, and Speechmatics have very little difference between the transcripts, a WER of just 1.4. That suggests those three services are using very similar models and training data.
My primary goal is to get the transcripts available and searchable, so the next phase would be developing a simple process to transcribe each podcast and convert the result into web pages. That much seems straightforward using the Google Text-to-Speech API. Then there’s working with the podcast host to integrate with their website, style, menus etc.
After that the steps are a more fuzzy. I’ll be crossing the river by feeling the stones…
The automated transcripts will naturally have errors that people notice (and more that they won’t). To improve the quality it’s important to make it very easy for them to contribute corrections. Being able to listen to the corresponding section of audio would be a great help. All that will require a web-based user interface backed by a service and a suitable data model.
The suggested corrections will need reviewing and merging. That will require its own low-friction workflow. I have a vague notion of using GitHub for this.
Generating transcripts from at least one other service would provide a way to highlight possible errors, in both words and punctuation. That would be useful for readers and also encourage the contribution of corrections. The ‘other services’ could include transcriptions contributed by volunteers. This kind of multi-transcription functionality has significant implications for the data model.
I’d like to directly support translations of the transcriptions. The original transcription is a moving target as corrections are submitted over time, so the translations would need to track corrections applied to the original transcription since the translation was created. Translators are also very likely to notice errors in the original, especially if they’re working from the audio.
Before getting into any design or development work I’d want to do another round of due-dilligence research, looking for what services and open source projects might be useful components or form good foundations. Amara springs to mind. If you know of any existing projects or services that may be relevant please add a comment or let me know in some other way.
I’m not sure when, or even if, I’ll have any further updates on this hobby project. If you’re interested in helping out feel free to email me.
I hope you’ve found my rambling explorations interesting.
- 25th May 2018: Updated SimonSays.ai with much improved score