The medium of podcasting continues to grow in popularity. Americans, for example, now listen to over 21 million hours of podcasts per day. Few of those podcasts have transcripts available, so the content isn’t discoverable, searchable, linkable, reusable. It’s lost.
The typical solution is to pay a commercial transcription service, which charge roughly $1/minute and claim around 98% accuracy. For a podcast producing an hour of content a week, that would add an overhead of around $250 a month. A back catalogue of a year of podcasts would cost over $3,100 to transcribe.
When I remember fragments of some story or idea that I recall hearing on a podcast, I’d like to be able to find it again. Without searchable transcripts I can’t. It’s impractical to listen to hundreds of old episodes, so the content is effectively lost.
Given the advances in automated speech recognition in recent years, I began to wonder if some kind of automated transcription system would be practical. This led on to some thinking about interesting user interfaces.
This (long) post is a record of my research and ponderings around this topic. I sketch out some goals, constraints, and a rough outline of what I’m thinking of, along with links to many tools, projects, and references to information that might help. I’ve also been updating it as I’ve come across extra information and new services.
I’m hoping someone will tell me that such a system, or parts of it, already exist so that I can contribute to those existing projects. If not then I’m interested in starting a new project – or projects – and would welcome any help. Read on if you’re interested…
Here is an outline of functionality that I’d like from a basic automated system:
- Produce podcast transcripts as plain text on static web pages that are indexed by search engines.
- Provide anchors to make it easy for people to link to a particular section, or sections, in the transcript.
- Provide buttons to play the audio/video from that point. This requires the transcription to have timecode data.
- Identify and show who is speaking, e.g. via speaker diarisation.
Of course, an automated transcription is likely to have errors. Perhaps many. For a popular podcast there are likely to be some members of the audience (perhaps many) who are willing to contribute some amount of time to checking and correcting errors, somewhat like Wikipedia. A low-friction user experience makes that more likely.
In other words, crowdsourcing of error checking and correction may be a viable way to close the “quality gap” between manual and automated transcriptions. At this point I’ve no idea how big that gap will be, though I’m confident it can be made small enough for this whole endeavour to be worthwhile. (I’m assuming that the podcasts will have clear high-quality audio.)
I have explored the options for transcription in more detail below.
Beyond the basic transcription, presentation, and editing features there are many interesting possibilities for future enhancements.
Natural language processing
- Using keyword extraction to automatically identify suitable keywords for indexing, to aid search and discovery. Also entity extraction to identify the names of things, such as people, companies, or locations.
- Identification of topic segments within a podcast is much more difficult, but also more useful. This is an interesting area of research, e.g. Maui (software). I’d like to support overlapping segments to cover both high-level themes and the specifics within them.
- The keyword extraction could then be applied to individual segments, as well as whole podcasts, to aid finer-grained indexing.
- Some kind of classification of topics into, or with, a taxonomy might also be helpful for someone exploring a large topic space.
- Generate automatic summaries of segments. The summaries for all the segments would form a summary of the episode.
Those would open up alternative ways to search and explore a collection of podcasts. You’d be able to easily read or listen to all the segments that touch on a given topic across many episodes. Perhaps stitching them into a thread or ‘playlist’ you can share with others, somewhat like Storify.com.
There are also more immediate, practical problems such as recognising the boundary between sentences and fixing the casing of words. These aren’t critical but would significantly reduce the error checking and correction required to create a high quality transcript.
It should be clear by now that the underlying transcript data will need to be stored in some kind of database where it can be augmented with timecodes, speakers, segment details, keywords etc.
The database would also support user interfaces for error checking and correction, fine-tuning segments, and keywords etc.
From there the transcripts could be output in a variety of forms, from static web pages to rich interactive tools for exploration and sharing.
Full Text Search
Web search engines like Google and Bing are very good at what they do. Yet they are very general tools, trying to do the best they can for all the web pages on the internet. There are better tools for specific jobs.
One that I’m familiar with is Elasticsearch which has a rich set of features for dealing with human language and powerful full-text search capabilities. Beyond its general capabilities, it can be taught synonyms specific to the topics covered by the podcast. This would significantly improve the quality of search results.
I generally listen to podcasts as audio, while driving or resting, even those that are available as videos. I hadn’t given any thought to subtitles as another output format until I started researching what transcription tools, projects and services already existed. I’ll talk more about it below.
Here’s a simple schematic, for what it’s worth:
What’s Out There
Applications that Facilitate Manual Transcription
These tools typically provide a user-interface that combines a media player with a text editor. You play the media and start typing what you hear (as fast as you can), pause, rewind a bit, repeat.
Here are a selection for reference, in no particular order:
- InqScribe for Mac and Windows. $39-99.
- HyperTRANSCRIBE, Mac and Windows. $40.
- Transana, Mac and Windows. $75.
- Transcriber Pro, Windows only. €10/year
- GearPlayer, Windows only. $120.
- pmTrans, open source for Linux, Mac, and Windows. Free.
- Express Scribe, for Mac and Windows. Free.
- Transcribe, web service, $20/year.
- Scribie transcription editor, web service. Free.
- oTranscribe, web app, open source.
- NowTranscribe combines automatic generation of a draft with predictive correction and automatic control of the audio playback. It’s an innovative approach that’s worth seeing in action.
If you’re performing manual transcription at the moment, especially with a standard word processor, I’d urge you to try some of these. They may smooth out the process in many small ways that accumulate to save you a lot of time and effort.
When performing manual transcription it obviously helps to be able to type fast, ideally fast enough to keep up with the speakers. Approximate words per minute rates are around 150–200 for typical podcast speakers, and 40–80 for average-to-good typists. That difference creates a problem.
Very few transcribers can keep up with typical speakers. The usual solution is to use a foot pedal to rewind the media by a few seconds whenever needed, that way your fingers can stay on the home row of the keyboard. Yet every time you rewind there’s a break in your flow and productivity falls.
An alternative approach is to slow the media playing down to match a comfortable typing rate. This can be done with audio time-scale/pitch modification techniques such as PSLOA which can change the speed without altering the pitch. Most of the tools I’ve listed above support variable speed playback, but only a few explicitly mention maintaining the correct pitch. The free web-based Scribie transcription editor seems particularly good at this.
Commercial Transcription Services
These provide a service where you upload an audio or video file and get back a file containing the transcription. You’re paying some amount of money for someone to use an application (like those above) on your behalf, plus some level of quality checking. I’ll only list here a few services that provide timecoded transcriptions, including subtitling services.
At the very high-end, 3play Media are a traditional transcription service provider offering “Premium quality with +99% accuracy” for prices ranging from $2 to $3 per minute. They provide an API for upload/download.
At the very low-end, if you’re willing to handle the management of the work then Fiverr have a number of people offering transcription services for $5 (typically for 10 to 20 minutes of transcription). Your mileage will vary.
In the innovative-middle-ground, Scribie guarantee +98% accuracy, offer prices down to $0.70/min for 20-30 day turnaround, and include time-coding. There’s an additional charge of $1.00/minute for producing subtitles (SBV/SRT). They provide an API and have an interesting blog. They also make their own transcription editor web application freely available for anyone to use. I like their technology and ‘managed crowdsourcing‘ approach.
Commercial Transcription Services (behind the scenes)
Speaking of crowdsourcing, while researching this post I came across CrowdSurfWork. This site is an interface for freelance transcribers to work on “micro-tasks” related to transcription. Their system is built on Amazon.com’s Mechanical Turk service, which provides a marketplace for “Human Intelligence Tasks”. Typical micro-tasks include transcribing a chunk of audio (“up to 35 seconds”), reviewing and scoring a chunk of transcript, quality checking a whole transcript etc. CrowdSurfWork don’t say who their clients are. They’re certainly not the only ones using Mechanical Turk for transcription work.
Commercial services provide a complete transcription service: audio in, high quality transcript out. Internally that work is usually broken down into a transcription phase and a quality check/edit phase. I wonder if some companies could offer a service that takes a raw initial transcript (e.g. generated by an automated transcription system) and just perform the quality check/edit phase, at a lower cost.
I also wonder if It seems very likely that some companies are already using automated transcription systems, especially for regular clients where the system could be trained for the clients voice.
Free Automated Transcription
Automatic speech recognition has come a long way in recent years, with untrained speaker-independent systems achieving useful levels of accuracy.
Google Docs now supports Voice typing which you can use to transcribe your voice, or other audio being played at the time. It only works in the Chrome browser, or the Docs app on iOS or Android. Here’s a demo. (See also Speechlogger which uses the same underlying Google technology and has some handy tips on improving the quality when transcribing audio files by using a “virtual line-in cable”. See also Loopback for Mac.)
Another relevant way to access Google’s speaker-independent speech recognition is to upload a video to YouTube and let it provide Automatic Captioning for you. More on that below.
On a Mac you can use your voice to enter text into almost any application. The default mode uses a web service but you can enable Enhanced Dictation which installs the recognition code locally so you don’t need an internet connection and can “dictate continuously”.
These don’t offer any customisation or training to improve the accuracy.
Microsoft Windows offers a similar Speech Recognition service. It supports a customisable speech dictionary and accuracy improves with usage. As far as I can tell this is built in to the operating system and doesn’t use a network service.
There are a number of speech recognition projects for Linux. I have not looked into them in detail. If you have experience with any that would fit this project I’d be grateful if you would get in touch with me.
Commercial Speech-to-text Services
The Google Cloud Speech API offers access to APIs for applications to “see, hear and translate”. It’s based on the same neural network tech that powers Google’s voice search in the Google app and voice typing in Google’s Keyboard and Chrome described above. It offers some customization in the form of a list of phrases (up to 500, provided with the API request) that act as “hints” to the speech recognizer to favor specific words and phrases in the results. The current limits cap audio length at 80 minutes and require use of uncompressed audio.
Nuance, who currently provide the technology behind Apple’s Siri and dictation services, offer a HTTP REST Cloud speech recognition service that’s targeted at mobile devices. (I presume this is the service behind their new and expensive, Dragon Anywhere mobile dictation app.)
The service supports uploading custom phrases and vocabularies. It also allows you to specify an ID for the speaker which is used for Speaker-Dependent Acoustic Model Adaptation (SD-AMA). This “creates adapted acoustic model profiles from audio collected from each user to improve recognition performance over time.” Both of these should help improve accuracy beyond what’s possible with speaker-independent services like those from Google or Apple.
The pricing is $.008 / transaction where a ‘transaction’ is a successful HTTP request, presumably about a sentence (I’ve seen references to 30 seconds as a maximum). Their terms require ‘Emerald Level’ payment when the client isn’t a mobile device. Some negotiation might be required!
Microsoft provide a Bing Speech API. The REST API only supports 10 seconds of audio per request, similar to Nuance ‘transactions’ described above. Their Client Library supports streaming.
IBM offers their Watson Developer Cloud Speech to Text service. It has both HTTP REST and WebSocket APIs. The pricing is free for the first thousand minutes per month, then $0.02 per minute. The IBM service doesn’t support SD-AMA
or custom vocabularies. Support for custom vocabularies was added in September 2016. (They’ve said they’re working on speaker diarization.) The results include timestamps, confidence indicators and alternative suggestions. Here’s an example use to translate a ProPublica podcast.
Vocapia provide a Speech to Text API service called VoxSigma. It returns “XML with speaker diarization, language identification tags, word transcription, punctuation, confidence measures, numeral entities and other specific entities”. They also support customization in the form of ‘Language Model Adaptation’ by uploading sample text. I’ve requested technical documentation and pricing details, neither of which are on their web site. They’ve given me a trial account to test the service.
Speechmatics provide speech to text services with a simple REST API. The transcript data includes speaker diarization, word transcription, punctuation, and confidence measures. They don’t offer any customization. Pricing is £0.06/minute (£3.60/hour), with the first hour free. Speechmatics claim to be the world’s most accurate transcription service.
Voicebase provide a transcription service. They’re using a different version of Speechmatics technology (with slightly lower accuracy it seems). I’m including them here because they provide interesting keyword extraction features. From a two hour interview I uploaded they extracted 94 keywords (like “ecological limits”, “symbolic language” etc.) and grouped them under 170 headings (like “Bioethics”, “Ontology” etc.). Clicking on a keyword or group, or entering search terms manually, shows all the places in the audio timeline where the topic is spoken about. You can then easily listen to just those parts. As you do the relevant portion of the transcript is highlighted. When you sign up they give you $60 (US) free credit. I didn’t see any rates quoted but it appears to be $0.02/minute. Output formats are PDF, RTF, and SRT.
SpokenData offers automated transcription with an interactive transcription editor, API, and optional human transcription services. It’s a project of Czech company ReplayWell. Pricing is €0.10/minute down to under €0.05/minute for bulk. The first hour is free. Other services, including speaker segmentation (diarization), are currently free. Transcript formats include SRT, TXT, TRS, XML.
Deepgram also provide an automated transcription service. Pricing is under $0.02/minute. They have a basic transcription viewer and a minimal dashboard. To download a transcript you have to use make an API call with a “get_object_transcript” action that’s not currently documented in their rather minimal API documentation. The transcript format is JSON with per-paragraph timings.
Pop Up Archive offers a service that seems an ideal fit for these requirements. You upload a file and they tag, index & transcribe it automatically, including timestamps and speaker diarization. They provide an interactive transcript editor synced to the audio, team plans allow concurrent editing by multiple people. Download transcripts in .TXT, .XML, .JSON, .WEBVTT, and .SRT formats, and there’s an API. (Looking at the output it looks like they’re using Speechmatics as the backend transcription service.) They provide a search and browse interface for the thousands of podcast transcripts they’re hosting, plus a HTML code generator for embedding players on your own website. Pricing ranges from $0.25/min down to $0.20/min on monthly plans. One hour free credit.
Pop Up Archive have an interesting project called Audiosear.ch which is billed as “a full–text search and intelligence engine for podcasts and radio”. It includes a ClipMaker feature that makes it easy for anyone to search for and select a favorite podcast moment and share it on social media as a short auto-playing video of the audio and transcript. Take a look and try it out.
Commercial Speech-to-text Applications
These are applications which you install and run on your own machine. Modern machines and software are fast enough for high quality results in realtime. A key feature is the ability to train the software to improve the recognition of a particular voice. This, combined with custom vocabularies, greatly improves the accuracy.
Ignoring companies offering niche products (like vestec, SRI, and verbio) which don’t provide documentation or prices online, there’s only one major player left in this field: Nuance, with their Dragon line of products for PC and Mac.
Dragon can learn your vocabulary and likely phrases by reading documents or emails you’ve written. For transcribing podcasts it could be given some existing transcriptions, if you have any. It will also learn from the corrections you make while dictating. All this training is tied to a single voice profile so Dragon will only work well with a single voice at a time.
It’s also important to note that Dragon, unlike services such as Trint and Speechmatics, will only give you a bare stream of words. There’s no segmentation into sentences and paragraphs. You’ll have to do that by hand, along with capitalizing the first word. So even if Dragon was very accurate you’ll always be left with a lot of work to do.
This thread on Ycombinator from March 2016 includes a variety of opinions, including “As someone who’s worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space”; “If Nuance is 100%, I’d say CMUSphinx is at least 40%”; “As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I’ve tested.”
Spoiler: in my testing so far Trint.com and Speechmatics.com are much better than Watson and (untrained) Nuance. I’ll post detailed results when I’ve finshed testing.
State of the Art
The state of the art in speech recognition is advancing very rapidly at the moment as Deep Learning and other modern machine learning techniques are being applied ever-more successfully. One of the most difficult of all human speech recognition tasks is conversational telephone speech, very similar to the conversational podcast speech we’re exploring in here. Recent research, published in October 2016, has shown that it is now possible to achieve human parity in conversational speech recognition. A significant research milestone that should be reflected in commercial systems in the future.
Verbatim vs Clean Transcription
Informal speech is often littered with stutters, filler words (‘ah’, ‘um’, ‘like’ etc.), and other forms of speech disfluency. Conversational speech often contains ‘conformational affirmations’ such as “Uh-huh.”, “I see.”
Commercial transcription services will, by default, provide you with a ‘clean’ transcript that doesn’t include every utterance in the audio. The disfluencies and conformational affirmations are skipped. A ‘verbatim’ transcription service is often available at a higher cost to account for the work that’s needed to capture the extra details.
Depending on the amount of disfluency, a clean transcript can be significantly easier to read than a verbatim transcript.
An automated transcription system will naturally produce a verbatim transcript. Cleaning up a verbatim transcript automatically is an active area of research. For our purposes in the short term I imagine some typical cases could be recognised and edited out automatically. The rest would have to be dealt with as part of the crowdsourced manual QA process.
So can a viable automated podcast transcription solution be built from these options?
Dragon applications offer the highest accuracy but only work well with a single voice, don’t provide automatic timecodes, and are hard to automate.
Free automated transcription services offer no training or customisation and don’t provide timecodes directly.
The Watson Developer Cloud Speech to Text service offers timecodes but no training or customisation. It might be workable but is likely to be relatively poor quality, especially without diarisation.
The Nuance Cloud Speech Recognition service would require me to pre-process the audio into small chunks, presumably based on pauses. That would mean I’d effectively generate timecodes myself but at the cost of significant extra audio processing upfront. Quality is bound to suffer, especially in segments where pauses aren’t clear.
Considering pre-processing the audio opens up extra possibilities. In addition to identifying pauses, I could also implement diarisation (e.g. using one of the open source tools). That would not only improve the chunking, where one speaker starts talking over another, but also open up interesting solutions for the single speaker problem…
Given the details of who is speaking when, a separate audio file for each speaker could be generated, with the voice of the other speaker replaced by silence. (An audio editor that supports a cue sheet would make that simple.) Each per-speaker file could then be fed to Dragon with the appropriate profile for that speaker. After a short period of training the rest of the transcription for that speaker could proceed automatically and with higher accuracy.
That would solve the single speaker problem but there’s still a lack of timecodes in the transcript. A few approaches spring to mind but the most interesting is to insert the timecodes into the audio stream as spoken words, e.g. “zero seven colon one five space”, perhaps using a text-to-speech tool. Then there would be no need to keep the periods of silence for the other speaker. The audio file would dictate its own timecodes!
The transcripts generated for each speaker could then be merged using the timecodes in the text to interleave them in the correct order. (Though in practice they’d probably simply be written into a database with the timecode as a key.)
Slicing the audio per-speaker would also enable a neat solution to the problem of poor quality recordings of interviews where the remote person has a poor internet connection. If they made a separate local recording of their voice then that audio file could be sliced up and used for the transcription of the parts of the interview where they were speaking. Neat!
When you listen to someone you absorb more than when just reading their words. Transcriptions help you search and discover sections of interest, but then it should be easy to listen to the words.
This is why having timecodes is important. Having searched transcripts to discover sections of interest you could click a button to listen to just those parts. For video podcasts you might choose to watch, giving you the added dimension of all the non-verbal communication.
Where do subtitles, and their more feature-full modern cousin, captions, fit in? For the deaf, the hard of hearing, and non-native speakers, they offer the opportunity to read the words in sync with the added richness of the non-verbal communication.
In theory subtitles/captions could be generated directly from a transcript if it has sufficiently frequent and accurate timecodes. Speaker diarisation would also help. That should be enough to generate at least a good quality draft. Which raises the “quality gap” question again: could automatically generated subtitles/captions be made “good enough” that the effort of manual correction is significantly less than the effort of manual creation? I think so.
Note that there will almost always be a need for some manual editing. For example, carefully condensing the number of words to fit within typical reading speeds, or adding captions for sounds, like “[dog barking]”.
Syncing the timing of the appearance (and disappearance) of each subtitle is a painstaking process that consumes the most time of any portion of the captioning process. Here’s an example video of the manual syncing process.
One way to avoid the effort is to let YouTube perform Set Timings on a plain-text transcript for you (Video of announcement and demo in 2009.) It’s “not recommended for videos that are over an hour long or have poor audio quality”. If that does work well then it would remove the need to generate timecodes myself.
I presume that having a ‘verbatim’ transcript, rather than a ‘clean’ one, would help the YouTube Set Timings processing to be more reliable.
Applications and Services
Wikipedia has a comparison of subtitle editors that provides an incomplete list of free and commercial editors for various platforms. There’s also a list in the “Use captioning software & services” section of the Add subtitles & closed captions YouTube help page.
I’ll just highlight a few interesting ones here:
Voxcribe offer a commercial Windows application called VoxcribeCC that uses speaker-independent speech recognition technology to automatically caption a video. The first 60 minutes is free, then you pay-as-you-go for $7-$10 per hour. Output formats are Subrip (srt) and Timed Text (xml). It doesn’t support training or custom vocabularies.
Amara deserves a special mention: Amara is an open-source and non-profit collaboration community for captioning and subtitling video. A ‘Wikipedia for Subtitles’, Amara enables volunteers to make videos accessible for people who are deaf and hard of hearing and anyone who doesn’t speak the language of the original video. Amara has more than 100,000 subtitling volunteers and organizations like TED, Khan Academy, and PBS use it to make video accessible.
Amara is a project of the Participatory Culture Foundation. (YouTube also supports community-contributed directly, along with paid translations.) The open-source code is being actively developed and includes a rich API.
Here’s an outline for one (of many) possible workflows:
- Generate a verbatim transcript from the audio.
- Generate and upload a transcript file formatted for YouTube.
- Request YouTube to perform a Set Timings operation.
- Download the subtitles and timecode data.
- Clean up the verbatim transcript.
- Combine with the speaker diarisation data, if available.
- Generate the interactive transcript pages.
- Condense subtitle wording to fit reading speed, if needed.
- Upload condensed and diarised subtitles back to YouTube.
Proof of Concept Testing
I have mentioned lots of services in this post. Next I’m planning to do some very basic testing of the ones that seem likely to be useful. I’ll use some podcast audio for which I also have manual transcriptions. I want to get some experience with the various tools from the low-end (speaker independent) through to the high-end (Dragon with vocabulary and voice training). That will give me some sense of how big the “quality gap” really is. I’ll post some results when I have them.
I’ve written a follow-up post about how I’m Comparing Transcriptions – which turned out to be more tricky, and interesting, than I’d expected.
Naturally I’m glossing over lots of details here, and I know there’s lots I don’t know. At this stage I’m very much in exploratory mode, discovering possibilities to see what might be viable. I’m encouraged by what I’ve found so far and can see interesting paths worth exploring.
I have no particular experience with audio processing or bulk transcription, but I am interested in helping more podcasts to have rich searchable transcripts available.
Are you? Great! Please get in touch.
Appendix of Random Notes
Some of the most common Subtitle and Caption File Formats are:
- SRT – “SubRip Text” – a standard subtitle format supported by most video players
- SSA – “SubStation Alpha” format that allows more advanced subtitles than the conventional SRT format
- TTML – “Timed Text Markup Language”, an XML format that is one of W3C’s standards regulating timed text
- DFXP – “Distribution Format Exchange Profile”, the old name for TTML
- SBV – “SubViewer” plain text format, similar to SRT. Also known as .SUB
- VTT – “Web Video Text Tracks”, very similar to SubRip, supported by most browsers
YouTube supports many subtitle and caption file formats.
- 2016-04-23: Added Speechmatics, NowTranscribe, Google Cloud Speech API, Microsoft Bing Speech API, and the Anecdotal Accuracy section.
- 2016-11-08: Updated IBM Watson entry to note that support for custom vocabularies was added in September 2016.
- 2016-11-22: Added “State of the Art” section with a link to the recent Achieving human parity in conversational speech recognition paper.
- 2016-11-27: Added voicebase with details of the keyword extraction UI.
- 2016-12-03: Updated Google Speech API details. Added a link to a talk where Speechmatics claim to be the world’s most accurate. Some other minor edits.
- 2016-12-30: Added details for SpokenData, Deepgram, Trint, Spreza and Voyz.es.
- 2017-02-01: Added Pop Up Archive and AudioSear.ch. Plus a note on Dragon pointing out that there’s no segmentation into sentences.
- 2017-02-09: Added link to the Comparing Transcriptions follow-up post.
- 2018-04-03: Added link to Maui topic-extraction software, thanks to Rob Wilkinson.