After a pause I am working again on my semi-automated podcast transcription project. The first part involves evaluating the quality of various methods of transcription. But how?
In this post I’ll explore how I’ve been comparing transcripts to evaluate transcription services. I’ll include the results for some human-powered services. I’ll write up the results for automated services in a later post.
The key metric for transcription is accuracy. How closely the words in the generated transcript match the spoken words in the original audio.
To compare the words in the transcript with the audio you need a reference transcript that’s deemed to be completely accurate; a ground truth upon which comparisons can be based. Then the sequence of words in that reference transcript can be compared against the sequence of words in the hypothesis transcript from each system being compared.
Word Error Rate
The Word Error Rate (WER) is a very simple and widely use metric for transcription accuracy. It’s a number, calculated as the number of words that need to be changed or inserted or deleted to convert the hypothesis transcript into the reference transcript, divided by the number of words in the reference transcript. (It’s the Levenshtein distance for words, measuring the minimum number of single word edits to correct the transcript.) A perfect match has a WER of zero, larger values indicate lower accuracy and thus more editing.
One Word Per Line
I decided early on that I’d simply convert a transcript text file into a file containing one word per line, and then use a simple diff command to identify the words that need to be changed/inserted/deleted, and the diffstat command to count them up.
Simple, in theory. In practice a significant amount of ‘normalization’ work, which I’ll describe below, was needed to reduce spurious differences.
Visualizing the Differences
A very useful command for inspecting and comparing these ‘word files’ is vimdiff. It gives a clear colour-coded view of the differences between up to four files.
Here’s an example comparing word files that haven’t been normalized. The left-most column is the transcript produced by a human volunteer. The other three columns, from left to right, were generated by automated systems (in this case VoiceBase API, Dragon by Nuance, and Watson by IBM).
This very small section of the transcript has several interesting differences.
Before I talk about normalization I want to draw your attention to the second column, the automated transcript by VoiceBase. Note the “two thousand and two” vs “two thousand two” in the fourth column, and “just down the road” vs “just on the road” in the other three columns. The phrases “just down the road” and “two thousand and two” would be more common than the alternatives, yet the alternatives are correct in this case.
I suspect this is an example of automated services giving too much weight to their training data when selecting the best hypothesis for a sentence. (It’s a similar situation to autocorrect correcting your miss-spellings with correctly spelt but inappropriate replacements.) A key point here is that, because the chosen hypothesis is likely to read well, it’s harder for a human to notice this kind of error.
You can see from the vimdiff output above that numbers can cause differences to be flagged even though the words convey the correct meaning: “2,000” vs “2000” vs “two thousand”, and “2002” vs “two thousand and two” vs “two thousand two”. To address this I wrote some code to ‘normalize’ the words. For numbers it converts them to word form and handles years (“1960” to “nineteen sixty”) and pluralized years (“1960s” to to “nineteen sixties”) as special cases.
Words containing apostrophes are another case of differences: I’m splitting on any non-word character so “they’re” is being split into two words. Some transcription systems might produce “they are” (not shown in this example). Apostrophes are tricky. To address this the normalization code expands common contractions, like “they’re” into “they are”, handles some non-possessive uses of “‘s” and removes the apostrophe in all other cases. It’s a fudge of course but seems to work well enough.
Another significant area for normalization are compound words. Some people, and systems, might write “audio book” while others write “audiobook” or “audio-book”. To address this the normalization code expands closed compound words, like “audiobook” into the separate words “audio book”. It seemed be too much work to generalize this so the normalization code has a hard-coded list that detects around 70 compound words that I encountered while testing. (Remember that the goal here isn’t perfection, it’s simply reducing the number of insignificant differences for my test cases so the WER scores are a more meaningful.)
Other normalizations the code handles include ordinals (“20th” becomes “twentieth”), informal terms (“gotta” becomes “got to”), spellings (“realise” becomes “realize”), and some abbreviations are collapsed (“L.A.” becomes “LA”).
Transcripts Produced by Humans
For my evaluation I chose a podcast episode of just under two hours length, that had good quality audio, and already had a transcript produced by a volunteer. The primary voice was an American male who spoke clearly but quickly and without a strong accent.
I suspected that a single human-produced transcript wouldn’t be sufficiently reliable so I ordered human-produced transcripts from three separate transcription services. This turned out to be more interesting, and more useful, than I’d anticipated.
- 3PlayMedia “Extended (10 Days)”
- Rate $3/min = $333.44 total.
- Completed in 7 days.
- Seven “flags” in the transcript, mostly “inaudible” or “interposing voices”.
- Scribie “Flex 5 (3-5 Days)”
- Rate $0.75/min = $83.35 total.
- Completed in 3 days. Formats: TXT, DOC, ODF, PDF. (No timecodes)
- VoiceBase “Premium 99%, 5-7 Days, 3 human reviews”
- Rate $1.5/min = $168.00 total.
- Completed in 4 hours. Formats: PDF, RTF, SRT (timecodes).
- Note that VoiceBase provide both human and automated transcription services.
The 4:1 difference in cost is notable! 3PlayMedia certainly charge a premium. Let’s see how their transcripts compare.
With the normalization I was expecting these transcripts, all produced by humans, to be closer to each other than they turned out to be. Here’s an example of some differences:
(In this vimdiff image, and all the ones that follow, the columns from left-to-right are: Volunteer, 3PlayMedia, Scribie, VoiceBase.)
That shows three different transcriptions of the “to use” phrase. It’s not very clear on the audio and I’d agree with the majority here and say that “you used” was correct. In this example it isn’t significant but it does illustrate that imperfect human judgement is needed when the audio is isn’t completely clear. Transcribers have to write something and can’t easily express their degree of confidence. If it falls too low they might write something like “[inaudible]” or “[crosstalk]”, but above that threshold they take a guess and you don’t know that. And because the guess is likely to read well it’s harder for a human to notice this kind of error.
The second difference shown in that image relates to the difference between a clean transcript and a verbatim transcript. In a clean transcript conformational affirmations (“Uh-huh.”, “I see.”), filler words (“ah”, “um”) and other forms of speech disfluency aren’t included.
That “you are” is in the audio, so would be in a verbatim transcript, but three of the four humans decided that it wasn’t significant and should be left out of the clean transcript. But one of the four decided it should be kept in. Other common examples include “I mean” and “you know”. There’s no right answer here, it’s a judgement call case-by-case.
It works the other way as well. Sometimes the transcriber will add a word or two that they think makes the text more clear. Compare “everything submits to and is accountable to” with “everything submits to it and is accountable to”. Two of the four humans decided to add an “it” that wasn’t in the audio. Similarly “believe it” vs “believe in it”, here again two of the four added an “in”, only this time it was not the same transcribers adding the word. Transcribers are likely to “clean” transcripts in a way that’s biased towards their own speaking style. Speakers have a distinct verbal style and changes like these by a transcriber can be more distracting than helpful if not in keeping with the speakers own style.
Generally these interpretations of the audio, and writing the corresponding text, are made with care and don’t alter the meaning for the reader. At least that’s what I was thinking until I came across this: To be fair this part of the audio is a little garbled due to crosstalk. I’m sure the speaker said “doesn’t matter” (which got normalized to “does not matter”). It’s another example of where the lack of confidence indicators in human transcripts is a problem.
Here are a few other examples of differences in these human transcripts that caught my eye:
Getting to Ground Truth
To compare transcription services I needed a reference transcript – a ground truth – against which to compare the others.
Since the transcripts varied significantly I had little choice but to create my own ‘ground truth’ transcript manually. I copied the transcript generated by the volunteer, then listened to the audio in the places where the various transcripts differed in a non-obvious way – well over 200 of them. For each place I decided what the most accurate transcription was and edited the ground truth transcript to match. The most difficult places were where multiple speakers talk over one another. It’s very hard to convey the intent clearly and accurately in a linear sequence of words. Often ‘clearly’ and ‘accurately’ are at odds with each other.
I repeated this process with the automated transcripts in order to add-in the disfluencies and conformational affirmations etc. present in the audio. In other words, to shift the ground truth transcript from a ‘clean’ transcript to being closer to a ‘verbose’ transcript. Without that work the apparent Word Error Rate of the automated transcripts would be unfairly higher. They would all be equally effected but that effect would reduce the visibility of the genuine differences. (With hindsight I should have used a separate file for this step but the overall process was iterative and exploratory rather than the linear sequence outlined here.)
The final ground truth transcript has 21,629 words.
The transcripts produced by the volunteer and by Scribie identified the speakers. The transcript from Voicebase identified transitions from one speaker to another, but didn’t identify the specific speaker. The transcript from 3PlayMedia didn’t identify speakers or transitions, despite costing three to four times as much.
3PlayMedia flagged seven places in the transcript with [? … ?] where the transcriber was unsure of the words but had made a reasonable guess, plus three instances of [INTERPOSING VOICES] and nine [INAUDIBLE]. Voicebase flagged five [CROSSTALK] and two [INAUDIBLE]. Scribie flagged none.
Most of 3PlayMedia transcription text flagged as unsure were correct. About half of the INTERPOSING VOICES and INAUDIBLE in the 3PlayMedia transcript the other services had accurate transcriptions for.
Segmentation / Punctuation
The volunteer transcript had 842 sentences in 158 paragraphs.
The 3PlayMedia transcript had 1259 sentences in 331 paragraphs.
The Scribie transcript had 915 sentences in 223 paragraphs.
The Voicebase transcript had 1077 sentences in 648 paragraphs.
I’m not sure what to make of those wide differences. The figures are a little noisy due to artifacts in the way the files were processed, but most the differences seem to be due simply to style.
The figures for the automated systems (in a future post) highlight those which do a very poor job of segmentation. The big downside of Dragon by Nuance, for example, is that it doesn’t do segmentation. You simply get a very long stream of words. So you’ll still have a lot of work to do to make them usable, no matter how accurate they might be.
Results for Human Transcription
Service WER Diarisation Timing Cost ========== === =========== ====== ======= 3PlayMedia 4.5 None Subtitles $333.44 VoiceBase 4.6 Transitions Subtitles $168.00 Scribie 5.1 Speaker names Paragraph $ 83.35 Volunteer 5.3 Speaker names None N/A
The important number here is the Word Error Rate (WER). Lower is better. The difference between 4.5 and 5.3 is quite small in practice. Most of the ‘errors’ are in parts of the transcript are due to insignificant differences or ambiguous sections due to cross-talk.
I suspect a WER around 5 represents a reasonable ‘best case’ for transcription of an interview. For comparison, the best of the automated transcription services I’m testing have WER of 12 to 16, with some in the 30 to 40 range.
All this work was to understand how to judge the accuracy of a transcription in order to evaluate automated systems. Comparing human transcription services turned out to be a useful approach to understanding the issues and help develop the tools.
It’s clear that for the highest accuracy it’s very helpful to use more than one service and check the places where they differ. Of course that significantly increases the cost and effort.
I’m testing a number of automated systems currently and I’ll include those results in a later blog post.