Semi-automated podcast transcription

The medium of podcasting continues to grow in popularity. Americans, for example, now listen to over 21 million hours of podcasts per day. Few of those podcasts have transcripts available, so the content isn’t discoverable, searchable, linkable, reusable. It’s lost.

The typical solution is to pay a commercial transcription service, which charge roughly $1/minute and claim around 98% accuracy. For a podcast producing an hour of content a week, that would add an overhead of around $250 a month. A back catalogue of a year of podcasts would cost over $3,100 to transcribe.

When I remember fragments of some story or idea that I recall hearing on a podcast, I’d like to be able to find it again. Without searchable transcripts I can’t. It’s impractical to listen to hundreds of old episodes, so the content is effectively lost.

Given the advances in automated speech recognition in recent years, I began to wonder if some kind of automated transcription system would be practical. This led on to some thinking about interesting user interfaces.

This (long) post is a record of my research and ponderings around this topic. I sketch out some goals, constraints, and a rough outline of what I’m thinking of, along with links to many tools, projects, and references to information that might help. I’ve also been updating it as I’ve come across extra information and new services.

I’m hoping someone will tell me that such a system, or parts of it, already exist so that I can contribute to those existing projects. If not then I’m interested in starting a new project – or projects – and would welcome any help. Read on if you’re interested…

My Goals

Here is an outline of functionality that I’d like from a basic automated system:

  1. Produce podcast transcripts as plain text on static web pages that are indexed by search engines.
  2. Provide anchors to make it easy for people to link to a particular section, or sections, in the transcript.
  3. Provide buttons to play the audio/video from that point. This requires the transcription to have timecode data.
  4. Identify and show who is speaking, e.g. via speaker diarisation.

Of course, an automated transcription is likely to have errors. Perhaps many. For a popular podcast there are likely to be some members of the audience (perhaps many) who are willing to contribute some amount of time to checking and correcting errors, somewhat like Wikipedia. A low-friction user experience makes that more likely.

In other words, crowdsourcing of error checking and correction may be a viable way to close the “quality gap” between manual and automated transcriptions. At this point I’ve no idea how big that gap will be, though I’m confident it can be made small enough for this whole endeavour to be worthwhile. (I’m assuming that the podcasts will have clear high-quality audio.)

I have explored the options for transcription in more detail below.

Beyond the basic transcription, presentation, and editing features there are many interesting possibilities for future enhancements.

Natural language processing

Automated natural language processing is becoming a lot more powerful and could be used to enrich the transcript with extra information. For example:

  • Using keyword extraction to automatically identify suitable keywords for indexing, to aid search and discovery. Also entity extraction to identify the names of things, such as people, companies, or locations.
  • Identification of topic segments within a podcast is much more difficult, but also more useful. This is an interesting area of research, e.g. Maui (software). I’d like to support overlapping segments to cover both high-level themes and the specifics within them.
  • The keyword extraction could then be applied to individual segments, as well as whole podcasts, to aid finer-grained indexing.
  • Some kind of classification of topics into, or with, a taxonomy might also be helpful for someone exploring a large topic space.
  • Generate automatic summaries of segments. The summaries for all the segments would form a summary of the episode.

Those would open up alternative ways to search and explore a collection of podcasts. You’d be able to easily read or listen to all the segments that touch on a given topic across many episodes. Perhaps stitching them into a thread or ‘playlist’ you can share with others, somewhat like

There are also more immediate, practical problems such as recognising the boundary between sentences and fixing the casing of words. These aren’t critical but would significantly reduce the error checking and correction required to create a high quality transcript.

Database Storage

It should be clear by now that the underlying transcript data will need to be stored in some kind of database where it can be augmented with timecodes, speakers, segment details, keywords etc.

The database would also support user interfaces for error checking and correction, fine-tuning segments, and keywords etc.

From there the transcripts could be output in a variety of forms, from static web pages to rich interactive tools for exploration and sharing.

Full Text Search

Web search engines like Google and Bing are very good at what they do. Yet they are very general tools, trying to do the best they can for all the web pages on the internet. There are better tools for specific jobs.

One that I’m familiar with is Elasticsearch which has a rich set of features for dealing with human language and powerful full-text search capabilities. Beyond its general capabilities, it can be taught synonyms specific to the topics covered by the podcast. This would significantly improve the quality of search results.

Video Subtitles/Captions

I generally listen to podcasts as audio, while driving or resting, even those that are available as videos. I hadn’t given any thought to subtitles as another output format until I started researching what transcription tools, projects and services already existed. I’ll talk more about it below.


Here’s a simple schematic, for what it’s worth: Transcription Data Flow Schematic

What’s Out There

Applications that Facilitate Manual Transcription

These tools typically provide a user-interface that combines a media player with a text editor. You play the media and start typing what you hear (as fast as you can), pause, rewind a bit, repeat.

Here are a selection for reference, in no particular order:

If you’re performing manual transcription at the moment, especially with a standard word processor, I’d urge you to try some of these. They may smooth out the process in many small ways that accumulate to save you a lot of time and effort.

When performing manual transcription it obviously helps to be able to type fast, ideally fast enough to keep up with the speakers. Approximate words per minute rates are around 150–200 for typical podcast speakers, and 40–80 for average-to-good typists. That difference creates a problem.

Users of Dvorak keyboards often report significantly faster typing speeds. For maximum speed you might be interested in the Open Stenography Project.

Very few transcribers can keep up with typical speakers. The usual solution is to use a foot pedal to rewind the media by a few seconds whenever needed, that way your fingers can stay on the home row of the keyboard. Yet every time you rewind there’s a break in your flow and productivity falls.

An alternative approach is to slow the media playing down to match a comfortable typing rate. This can be done with audio time-scale/pitch modification techniques such as PSLOA which can change the speed without altering the pitch. Most of the tools I’ve listed above support variable speed playback, but only a few explicitly mention maintaining the correct pitch. The free web-based Scribie transcription editor seems particularly good at this.

Commercial Transcription Services

These provide a service where you upload an audio or video file and get back a file containing the transcription. You’re paying some amount of money for someone to use an application (like those above) on your behalf, plus some level of quality checking. I’ll only list here a few services that provide timecoded transcriptions, including subtitling services.

At the very high-end, 3play Media are a traditional transcription service provider offering “Premium quality with +99% accuracy” for prices ranging from $2 to $3 per minute. They provide an API for upload/download.

At the very low-end, if you’re willing to handle the management of the work then Fiverr have a number of people offering transcription services for $5 (typically for 10 to 20 minutes of transcription). Your mileage will vary.

In the innovative-middle-ground, Scribie guarantee +98% accuracy, offer prices down to $0.70/min for 20-30 day turnaround, and include time-coding. There’s an additional charge of $1.00/minute for producing subtitles (SBV/SRT). They provide an API and have an interesting blog. They also make their own transcription editor web application freely available for anyone to use. I like their technology and ‘managed crowdsourcing‘ approach.

Commercial Transcription Services (behind the scenes)

Speaking of crowdsourcing, while researching this post I came across CrowdSurfWork. This site is an interface for freelance transcribers to work on “micro-tasks” related to transcription. Their system is built on’s Mechanical Turk service, which provides a marketplace for “Human Intelligence Tasks”. Typical micro-tasks include transcribing a chunk of audio (“up to 35 seconds”), reviewing and scoring a chunk of transcript, quality checking a whole transcript etc. CrowdSurfWork don’t say who their clients are. They’re certainly not the only ones using Mechanical Turk for transcription work.

Commercial services provide a complete transcription service: audio in, high quality transcript out. Internally that work is usually broken down into a transcription phase and a quality check/edit phase. I wonder if some companies could offer a service that takes a raw initial transcript (e.g. generated by an automated transcription system) and just perform the quality check/edit phase, at a lower cost.

I also wonder if It seems very likely that some companies are already using automated transcription systems, especially for regular clients where the system could be trained for the clients voice.

Free Automated Transcription

Automatic speech recognition has come a long way in recent years, with untrained speaker-independent systems achieving useful levels of accuracy.

Google Docs now supports Voice typing which you can use to transcribe your voice, or other audio being played at the time. It only works in the Chrome browser, or the Docs app on iOS or Android. Here’s a demo. (See also Speechlogger which uses the same underlying Google technology and has some handy tips on improving the quality when transcribing audio files by using a “virtual line-in cable”. See also Loopback for Mac.)

Another relevant way to access Google’s speaker-independent speech recognition is to upload a video to YouTube and let it provide Automatic Captioning for you. More on that below.

On a Mac you can use your voice to enter text into almost any application. The default mode uses a web service but you can enable Enhanced Dictation which installs the recognition code locally so you don’t need an internet connection and can “dictate continuously”.

These don’t offer any customisation or training to improve the accuracy.

Microsoft Windows offers a similar Speech Recognition service. It supports a customisable speech dictionary and accuracy improves with usage. As far as I can tell this is built in to the operating system and doesn’t use a network service.

There are a number of speech recognition projects for Linux. I have not looked into them in detail. If you have experience with any that would fit this project I’d be grateful if you would get in touch with me.

Commercial Speech-to-text Services

The Google Cloud Speech API offers access to APIs for applications to “see, hear and translate”. It’s based on the same neural network tech that powers Google’s voice search in the Google app and voice typing in Google’s Keyboard and Chrome described above. It offers some customization in the form of a list of phrases (up to 500, provided with the API request) that act as “hints” to the speech recognizer to favor specific words and phrases in the results. The current limits cap audio length at 80 minutes and require use of uncompressed audio.

Nuance, who currently provide the technology behind Apple’s Siri and dictation services, offer a HTTP REST Cloud speech recognition service that’s targeted at mobile devices. (I presume this is the service behind their new and expensive, Dragon Anywhere mobile dictation app.)

The service supports uploading custom phrases and vocabularies. It also allows you to specify an ID for the speaker which is used for Speaker-Dependent Acoustic Model Adaptation (SD-AMA). This “creates adapted acoustic model profiles from audio collected from each user to improve recognition performance over time.” Both of these should help improve accuracy beyond what’s possible with speaker-independent services like those from Google or Apple.

The pricing is $.008 / transaction where a ‘transaction’ is a successful HTTP request, presumably about a sentence (I’ve seen references to 30 seconds as a maximum). Their terms require ‘Emerald Level’ payment when the client isn’t a mobile device. Some negotiation might be required!

Microsoft provide a Bing Speech API. The REST API only supports 10 seconds of audio per request, similar to Nuance ‘transactions’ described above. Their Client Library supports streaming.

IBM offers their Watson Developer Cloud Speech to Text service. It has both HTTP REST and WebSocket APIs. The pricing is free for the first thousand minutes per month, then $0.02 per minute. The IBM service doesn’t support SD-AMA or custom vocabularies. Support for custom vocabularies was added in September 2016. (They’ve said they’re working on speaker diarization.) The results include timestamps, confidence indicators and alternative suggestions. Here’s an example use to translate a ProPublica podcast.

Vocapia provide a Speech to Text API service called VoxSigma. It returns “XML with speaker diarization, language identification tags, word transcription, punctuation, confidence measures, numeral entities and other specific entities”.  They also support customization in the form of ‘Language Model Adaptation’ by uploading sample text. I’ve requested technical documentation and pricing details, neither of which are on their web site. They’ve given me a trial account to test the service.

Speechmatics provide speech to text services with a simple REST API. The transcript data includes speaker diarization, word transcription, punctuation, and confidence measures. They don’t offer any customization. Pricing is £0.06/minute (£3.60/hour), with the first hour free. Speechmatics claim to be the world’s most accurate transcription service.

Voicebase provide a transcription service. They’re using a different version of Speechmatics technology (with slightly lower accuracy it seems). I’m including them here because they provide interesting keyword extraction features. From a two hour interview I uploaded they extracted 94 keywords (like “ecological limits”, “symbolic language” etc.) and grouped them under 170 headings (like “Bioethics”, “Ontology” etc.). Clicking on a keyword or group, or entering search terms manually, shows all the places in the audio timeline where the topic is spoken about. You can then easily listen to just those parts. As you do the relevant portion of the transcript is highlighted. When you sign up they give you $60 (US) free credit. I didn’t see any rates quoted but it appears to be $0.02/minute. Output formats are PDF, RTF, and SRT.

SpokenData offers automated transcription with an interactive transcription editor, API, and optional human transcription services. It’s a project of Czech company ReplayWellPricing is €0.10/minute down to under €0.05/minute for bulk. The first hour is free. Other services, including speaker segmentation (diarization), are currently free. Transcript formats include SRT, TXT, TRS, XML.

Deepgram also provide an automated transcription service. Pricing is under $0.02/minute. They have a basic transcription viewer and a minimal dashboard. To download a transcript you have to use make an API call with a “get_object_transcript” action that’s not currently documented in their rather minimal  API documentation. The transcript format is JSON with per-paragraph timings.

Trint don’t yet have an API for their automated transcription service, but they do have a nice interactive editor with pitch-corrected speed control. Pricing is $0.25-$0.20/minute. Trint “automatically identifies different speakers and segments them into separate paragraphs” (emphasis mine). That doesn’t seem quite right. The transcript is segmented into paragraphs but there’s no identification of speakers that I can find. The editor let’s you label the speaker for each paragraph, but you still need to do that manually for every single paragraph. Transcript formats are DOCX, SRT, VTT or “Interactive Transcript” which is a zip containing HTML and JavaScript. So there’s no pure-data transcript format available. (The “Interactive Transcript” zip contains the transcript in the form of HTML with a span with attributes for each word.) Review.

Pop Up Archive offers a service that seems an ideal fit for these requirements. You upload a file and they tag, index & transcribe it automatically, including timestamps and speaker diarization. They provide an interactive transcript editor synced to the audio, team plans allow concurrent editing by multiple people. Download transcripts in .TXT, .XML, .JSON, .WEBVTT, and .SRT formats, and there’s an API. (Looking at the output it looks like they’re using Speechmatics as the backend transcription service.) They provide a search and browse interface for the thousands of podcast transcripts they’re hosting, plus a HTML code generator for embedding players on your own website. Pricing ranges from $0.25/min down to $0.20/min on monthly plans. One hour free credit.

Pop Up Archive have an interesting project called which is billed as “a full–text search and intelligence engine for podcasts and radio”. It includes a ClipMaker feature that makes it easy for anyone to search for and select a favorite podcast moment and share it on social media as a short auto-playing video of the audio and transcript. Take a look and try it out.

Spreza and are two other service providers in this space. They’re both currently in private beta. I’ve applied for access.

In November 2017,aAlmost two years after originally writing this post, Amazon launched their Amazon Transcribe service which adds inferred punctuation, word-level timestamps, and recognises multiple speakers.

See also Pop Up Podcasting’s review of automated transcription tools.

Commercial Speech-to-text Applications

These are applications which you install and run on your own machine. Modern machines and software are fast enough for high quality results in realtime. A key feature is the ability to train the software to improve the recognition of a particular voice. This, combined with custom vocabularies, greatly improves the accuracy.

Ignoring companies offering niche products (like vestec, SRI, and verbio) which don’t provide documentation or prices online, there’s only one major player left in this field: Nuance, with their Dragon line of products for PC and Mac.

Dragon can learn your vocabulary and likely phrases by reading documents or emails you’ve written. For transcribing podcasts it could be given some existing transcriptions, if you have any. It will also learn from the corrections you make while dictating. All this training is tied to a single voice profile so Dragon will only work well with a single voice at a time.

It’s also important to note that Dragon, unlike services such as Trint and Speechmatics, will only give you a bare stream of words. There’s no segmentation into sentences and paragraphs. You’ll have to do that by hand, along with capitalizing the first word. So even if Dragon was very accurate you’ll always be left with a lot of work to do.

Anecdotal Accuracy

This thread on Ycombinator from March 2016 includes a variety of opinions, including “As someone who’s worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space”; “If Nuance is 100%, I’d say CMUSphinx is at least 40%”; “As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I’ve tested.”

Spoiler: in my testing so far and are much better than Watson and (untrained) Nuance. I’ll post detailed results when I’ve finshed testing.

State of the Art

The state of the art in speech recognition is advancing very rapidly at the moment as Deep Learning and other modern machine learning techniques are being applied ever-more successfully. One of the most difficult of all human speech recognition tasks is conversational telephone speech, very similar to the conversational podcast speech we’re exploring in here. Recent research, published in October 2016, has shown that it is now possible to achieve human parity in conversational speech recognition. A significant research milestone that should be reflected in commercial systems in the future.

Verbatim vs Clean Transcription

Informal speech is often littered with stutters, filler words (‘ah’, ‘um’, ‘like’ etc.), and other forms of speech disfluency. Conversational speech often contains ‘conformational affirmations’ such as “Uh-huh.”, “I see.”

Commercial transcription services will, by default, provide you with a ‘clean’ transcript that doesn’t include every utterance in the audio. The disfluencies and conformational affirmations are skipped. A ‘verbatim’ transcription service is often available at a higher cost to account for the work that’s needed to capture the extra details.

Depending on the amount of disfluency, a clean transcript can be significantly easier to read than a verbatim transcript.

An automated transcription system will naturally produce a verbatim transcript. Cleaning up a verbatim transcript automatically is an active area of research. For our purposes in the short term I imagine some typical cases could be recognised and edited out automatically. The rest would have to be dealt with as part of the crowdsourced manual QA process.

Podcast Transcription

So can a viable automated podcast transcription solution be built from these options?

Dragon applications offer the highest accuracy but only work well with a single voice, don’t provide automatic timecodes, and are hard to automate.

Free automated transcription services offer no training or customisation and don’t provide timecodes directly.

The Watson Developer Cloud Speech to Text service offers timecodes but no training or customisation. It might be workable but is likely to be relatively poor quality, especially without diarisation.

The Nuance Cloud Speech Recognition service would require me to pre-process the audio into small chunks, presumably based on pauses. That would mean I’d effectively generate timecodes myself but at the cost of significant extra audio processing upfront. Quality is bound to suffer, especially in segments where pauses aren’t clear.

Considering pre-processing the audio opens up extra possibilities. In addition to identifying pauses, I could also implement diarisation (e.g. using one of the open source tools). That would not only improve the chunking, where one speaker starts talking over another, but also open up interesting solutions for the single speaker problem…

Given the details of who is speaking when, a separate audio file for each speaker could be generated, with the voice of the other speaker replaced by silence. (An audio editor that supports a cue sheet would make that simple.) Each per-speaker file could then be fed to Dragon with the appropriate profile for that speaker. After a short period of training the rest of the transcription for that speaker could proceed automatically and with higher accuracy.

That would solve the single speaker problem but there’s still a lack of timecodes in the transcript. A few approaches spring to mind but the most interesting is to insert the timecodes into the audio stream as spoken words, e.g. “zero seven colon one five space”, perhaps using a text-to-speech tool. Then there would be no need to keep the periods of silence for the other speaker. The audio file would dictate its own timecodes!

The transcripts generated for each speaker could then be merged using the timecodes in the text to interleave them in the correct order. (Though in practice they’d probably simply be written into a database with the timecode as a key.)

Slicing the audio per-speaker would also enable a neat solution to the problem of poor quality recordings of interviews where the remote person has a poor internet connection. If they made a separate local recording of their voice then that audio file could be sliced up and used for the transcription of the parts of the interview where they were speaking. Neat!

Video Subtitles/Captions

When you listen to someone you absorb more than when just reading their words. Transcriptions help you search and discover sections of interest, but then it should be easy to listen to the words.

This is why having timecodes is important. Having searched transcripts to discover sections of interest you could click a button to listen to just those parts. For video podcasts you might choose to watch, giving you the added dimension of all the non-verbal communication.

Where do subtitles, and their more feature-full modern cousin, captions, fit in? For the deaf, the hard of hearing, and non-native speakers, they offer the opportunity to read the words in sync with the added richness of the non-verbal communication.

In theory subtitles/captions could be generated directly from a transcript if it has sufficiently frequent and accurate timecodes. Speaker diarisation would also help. That should be enough to generate at least a good quality draft. Which raises the “quality gap” question again: could automatically generated subtitles/captions be made “good enough” that the effort of manual correction is significantly less than the effort of manual creation? I think so.

Note that there will almost always be a need for some manual editing. For example, carefully condensing the number of words to fit within typical reading speeds, or adding captions for sounds, like “[dog barking]”.

Syncing the timing of the appearance (and disappearance) of each subtitle is a painstaking process that consumes the most time of any portion of the captioning process. Here’s an example video of the manual syncing process.

One way to avoid the effort is to let YouTube perform Set Timings on a plain-text transcript for you (Video of announcement and demo in 2009.) It’s “not recommended for videos that are over an hour long or have poor audio quality”. If that does work well then it would remove the need to generate timecodes myself.

I presume that having a ‘verbatim’ transcript, rather than a ‘clean’ one, would help the YouTube Set Timings processing to be more reliable.

Applications and Services

Wikipedia has a comparison of subtitle editors that provides an incomplete list of free and commercial editors for various platforms. There’s also a list in the “Use captioning software & services” section of the Add subtitles & closed captions YouTube help page.

I’ll just highlight a few interesting ones here:

Voxcribe offer a commercial Windows application called VoxcribeCC that uses speaker-independent speech recognition technology to automatically caption a video. The first 60 minutes is free, then you pay-as-you-go for $7-$10 per hour. Output formats are Subrip (srt) and Timed Text (xml). It doesn’t support training or custom vocabularies.

Amara deserves a special mention: Amara is an open-source and non-profit collaboration community for captioning and subtitling video. A ‘Wikipedia for Subtitles’, Amara enables volunteers to make videos accessible for people who are deaf and hard of hearing and anyone who doesn’t speak the language of the original video. Amara has more than 100,000 subtitling volunteers and organizations like TED, Khan Academy, and PBS use it to make video accessible.

Amara is a project of the Participatory Culture Foundation. (YouTube also supports community-contributed directly, along with paid translations.) The open-source code is being actively developed and includes a rich API.


Here’s an outline for one (of many) possible workflows:

  • Generate a verbatim transcript from the audio.
  • Generate and upload a transcript file formatted for YouTube.
  • Request YouTube to perform a Set Timings operation.
  • Download the subtitles and timecode data.
  • Clean up the verbatim transcript.
  • Combine with the speaker diarisation data, if available.
  • Generate the interactive transcript pages.
  • Condense subtitle wording to fit reading speed, if needed.
  • Upload condensed and diarised subtitles back to YouTube.

What Next?

Proof of Concept Testing

I have mentioned lots of services in this post. Next I’m planning to do some very basic testing of the ones that seem likely to be useful. I’ll use some podcast audio for which I also have manual transcriptions. I want to get some experience with the various tools from the low-end (speaker independent) through to the high-end (Dragon with vocabulary and voice training). That will give me some sense of how big the “quality gap” really is. I’ll post some results when I have them.

I’ve written a follow-up post about how I’m Comparing Transcriptions – which turned out to be more tricky, and interesting, than I’d expected.

A Project?

Naturally I’m glossing over lots of details here, and I know there’s lots I don’t know. At this stage I’m very much in exploratory mode, discovering possibilities to see what might be viable. I’m encouraged by what I’ve found so far and can see interesting paths worth exploring.

I have no particular experience with audio processing or bulk transcription, but I am interested in helping more podcasts to have rich searchable transcripts available.

Are you? Great! Please get in touch.

Appendix of Random Notes

Some of the most common Subtitle and Caption File Formats are:

  • SRT – “SubRip Text” – a standard subtitle format supported by most video players
  • SSA – “SubStation Alpha” format that allows more advanced subtitles than the conventional SRT format
  • TTML – “Timed Text Markup Language”, an XML format that is one of W3C’s standards regulating timed text
  • DFXP – “Distribution Format Exchange Profile”, the old name for TTML
  • SBV – “SubViewer” plain text format, similar to SRT. Also known as .SUB
  • VTT – “Web Video Text Tracks”, very similar to SubRip, supported by most browsers

YouTube supports many subtitle and caption file formats.


39 thoughts on “Semi-automated podcast transcription

  1. YES! I was just thinking about this. Would love if there would be some automated or semi-automated ways to transcribe podcasts / conference talks. It would be even better if the resulting data-set would be openly available under a license ensuring that it can be re-used and built upon.

    FYI, Google just opened up their speech recognition API. There is some good discussion about it (and alternatives) on HN:

    • Hi. Thanks. Interesting discussion on that thread. My understanding is that it’s a more formal interface to the same underlying service that’s described in the blog post. If so, then it would have the same pros and cons (no vocabulary customization or speaker profile), albeit simpler for another software to access. I’ll aim to include it in my testing.

  2. Excellent, in depth consideration of the many facets of the production of text from “others;” speeches. My particular interest is to capture something like 85-90% of the text, so I can ABSTRACT the contents of the audio (with timestamps), so I can share podcasts/videos in a way that captures some essential outline, so folks I send it to can decide WHAT PART to listen to in detail.
    So – maybe iteratively, it approached the proposed SCHEMATIC workflow. Will report, if I make progress…..

    • Hi Jerry. Our goals certainly appear to overlap. I’ve had limited spare time to pursue this, though it’s still very much on my mind. Please do keep in touch.

      • Popping in (I’m also on the project). We’ve done a few more experiments since then. These days, we’re using Watson for the initial transcription, and then a “crowdsourced” correction—i.e. we internally corrected it among a bunch of people. We then reverse-engineer the timings back into the corrected transcript with some other stuff—Jim can speak to that. We’ve got a bit of a release coming up soon here; will tweet it at you when it’s ready, if you like.

  3. Well done, and very good requirements analysis. Although turns out Google Cloud Speech API Beta can do batch mode on uploaded audio file, too bad it doesn’t provide timestamps.

  4. Excellent article! This is exactly what VOYZ.ES is aiming for, we are speaker diarizing and indexing Podcasts and interviews. We are a startup that tackles all the goals you mentioned (automated and crowdsourcing approaches) and employs almost all of the tech you are referring to.

    • Interesting. Some questions spring to mind… Any idea when it’ll be available for testing? What service(s) will you be using to perform the transcription? (I presume you’re not implementing your own.) Will you ‘unbundle’ the parts so it would be possible, for example, to upload a transcript produced by some other system to make use of just the crowdsourced corrections feature? If so, will timecodes be supported? How about confidence values that some ASR systems produce?

  5. Currently, we are in closed beta, and only accept a limited number of people for testing. We use all top cloud ASR services and train our own speech models. Problem with cloud services, as you mentioned, is the lack of support for custom vocabulary (Microsoft is standing out with providing better tools), language support, custom dialects, and so forth. NLP is also done with a combination of cloud and local services. Transcripts can be improved by your own contributions and via crowdsourcing. Full human transcript is possible at close to real-time. Currently, you can import from some Web sources with CC automatically, but upload and aligning of pre-existing transcripts with media content, is on the radar. Yes, we have time codes on speaker turns with confidence scores down to word level.

    • Hello, and thank you, Rajiv!

      Those links are indeed very interesting. I’ve signed up for, and uploaded my test file to, SpokenData, Trint, and Deepgram. Spreza is still closed beta, like, so I’ve registered my interest. I’m looking forward to including them all in my results. Thanks again!

      p.s. Regarding NowTranscribe, my understanding is that it uses Speechmatics as the backend but it’s not “from” Speechmatics.

  6. Tim (and all):

    Thanks for this overview. Developing a similar workflow/application in python/node.js, though not for podcasts. Would love to connect/collaborate.

    Wanted to add pyAudioAnalysis ( which seems to be in active development and tackles some of the audio analysis needed, i.e. diarization. Haven’t tested yet. Curious if others have had luck with this or other open source tools for diarization…

    • pyAudioAnalysis certainly looks interesting. Thanks Kate! I’ve not looked at diarization specifically yet. I’d also love to hear if others are having any luck.

  7. I’ve tested Trint with a single american midwestern speaker with long-ish pauses to achieve near perfect transcription. When used in combination with a human, I’ve seen Tint cut down transcription time by ~80-90%.

    ScaleAPI is working on a human/machine transcription option.

    PopupArchive does automated transcription for keyword analysis and archiving.

    • Hello Michael.

      I agree that Trint are very good. In my testing they’re achieving slightly better accuracy than Speechmatics, the previous best.

      Sadly the ScaleAPI service is limited to 30 minutes. I’ve joined an uploaded a test file.

      I’ve also now joined PopupArchive and uploaded a test file. If their service works well then it looks like a good fit for my needs. Thanks for the suggestion!

        • Judging from the results it seems PopupArchive are using Speechmatics to provide their transcription, so that’s good.

          ScaleAPI boosting to an hour would be good, but two hours is typical for the podcast I’m most interested in, and some are longer than that.

          I’m still hoping to find someone, anyone, who does a really good job with speaker diarization!

        • The only perfect solution I’ve found there is using a service such as Zencastr to record single speaker tracks and then merging the transcripts together using the timestamps. It cost 2x as much per episode, but I think it still comes under the .70-1.00/min of human transcription.

  8. Pingback: Comparing Transcriptions | Not this...

  9. Hi Tim.

    Great article. I’m CEO of Way With Words.

    We are in the transcription business as well, with our traditional transcription service for more custom transcript requests or larger group recordings –

    We recently launched a soon-to-be hybrid transcription solution built (at this time) on a different human transcriber model: We are in review for the next set of upgrades for corporate users from May 2017.

    Have a look and let me know if this fits with your considerations going forward.

  10. Pingback: Transkript für Logbuch Netzpolitik #232

  11. Pingback: The Advantages of Professional Video Transcription Services

  12. Hi Tim,

    Great analysis. But why don’t you just use ? It’s free and takes podcasts up to a couple of hours long, and gives you a pretty much instant transcription. There’s also a pretty nifty editor which means you can listen to your podcast, and make changes to the final transcript as you go. It’s been around for quite a while now, and is pretty accurate.

    • Hello Jules. Thanks for the suggestion. has a 60MB file size limit, which is too small for my needs. I truncated my sample file to 60MB to try it out. The resulting transcript was truncated with multiple out of memory errors. The initial portion of the transcript had a rather poor word error rate which was similar to VoxSigma, SimonSays, and Watson. The best systems I’ve tested are roughly twice as good.

  13. “2018-12-10: Added Amazon Transcribe and a link to another review of tools.”
    Little mistake in the date?

  14. We at Scribie just released our new speech recognition models. Here are our automated transcripts for comparison.

    We do diarisation a bit differently than Speechmatics and AWS Transcribe. So the paragraph breaks are better.

    We are also planning to post a WER/CER against LibrisSpeech clean and other datasets soon. Will post the link here when we do.

    • Thanks for the update. I’ve just retried it and the service has improved. The WER for my test file has improved from 14.2 to 12.9.

      (I’ll also mention that every time I use the Scribie UI I find it very frustrating. It doesn’t convey a clear mental model of the current state or how to achieve tasks. Needs reworking from the ground up. Even clicking on those links fails for me with an error saying “Error: The page ‘/files/uploaded,%20’ does not exist”, but only if I’m logged in.)

      • Thanks for trying it out! And we are working on streamlining the UI. The process for getting a automated transcript is a bit messed up right now. Where did you get the link from btw? I have been trying to track that issue down for sometime. That link is wrong. Can you send me a screenshot?

  15. Pingback: A Comparison of Automatic Speech Recognition (ASR) Systems | Not this…

Comments are closed.