We, the modern people, are tickled with our phones’ voice-recognition powers. We can ask questions! We can open apps with our voice!
You know who’s not so tickled? Anyone who records other people talking. Our phones are terrible at transcribing voices—converting them to text that we can edit.
I realize that most people don’t care about transcribing audio. But if you’re a reporter, producer, editor, author, YouTuber, filmmaker, student, documentarian, researcher, government agency, doctor, lawyer, or police officer, for example, you might really care. Manual transcription of audio and video files is an excruciating, tedious, soul-sucking exercise, and we’ve doing it pretty much the same way for 50 years.
The world waits for a method that’s fast, cheap, and accurate.
Until now, there have been only a few ways to convert a recording into text:
This is a review of a new, fifth approach: Trint.com. (The name, we’re told, is a combo of “transcript” and “interview.”) It lands on a new point in that speed-cost-accuracy continuum by (a) automating the conversion instead of hiring humans, and (b) providing a slick, easy way for you to breeze through the results and correct the errors.
“The idea is to take the very best of automated speech recognition [ASR] and push it as far as it will go, then give the user a simple tool to get those last yards,” Trint founder Jeffrey Kofman told me. “By combining a text editor with an audio/video player, we let you quickly search, verify, and correct the output of our ASR.”
The cost is $15 per hour of video—about a quarter of the cost of even the cheapest human web-based services.
You sign up. You upload your audio or video file. You pay in advance: $15 for an hour of converted audio or video. (If you’re willing to commit to doing a lot of this, the cost comes down to $12 an hour.)
You wait maybe five minutes—an insanely short time—and then it’s done. You open the transcription right there in your browser. Already, what you get is good enough that you can search for words or highlight the good parts.
But while the system correctly detects sentences and adds periods, it adds no other punctuation. It makes no attempt to add commas, for example. So you wind up with phrases like “I went to you know the store and bought peaches plums and pickles.” No question marks, either.
Now you read through it, correcting the errors, adding punctuation and paragraph breaks, and identifying speaker names using a pop-up menu. The video above shows what this process is like.
What’s kind of wonderful is how the audio or video playback is integrated with the editing: Wherever you click your mouse, that bit plays back automatically. There’s no Play, Pause, Rewind cycle here; the system always knows what to play when. (You can turn this playback on or off with a keystroke, and also control the playback speed.)
So how long does this cleanup process work? I tried Trint on seven interviews, and the editing generally wound up equaling the length of the interview. Thirty-minute interview, 30 minutes to clean it up.
That will never fly in the professional world. CBS News won’t be using Trint any time soon. But doing the job yourself would take five to ten times the length of the original recording. And if you hired a professional service, you’d pay four to eight times as much.
(For many people, of course, a full cleanup isn’t necessary; often, the point of transcribing an interview is just to skim it to find the good parts. That’s where Trint really shines. It’s simple to read or search for text in a transcript, highlight the juicy parts, and even play back only the highlighted portions. In that case, you can clean up only those few bits.)
To compare the results, I submitted the same interview recording to Trint, to Rev.com ($1 a minute for 24-hour turnaround), and to a high-end pro service. Here’s what I got back:
There are some bugs left to squash in Trint. For example, copy and paste don’t work in the text editor. (I was editing an interview that contained the word GISHWHES over and over again, a non-word that Trint never once transcribed correctly. I thought I could just paste it in over and over again, but no joy.) The company explains that if you went nuts, pasting in blobs of text, you’d throw off the software’s underlying links between the audio and the text.
Chrome extensions can trip up Trint, too. Every time I inserted a Return to break up a paragraph, some text would disappear. Turning off all my extensions fixed that.
You should also keep in mind that Trint requires clean, clear audio, in which your subject was miked. You can’t feed it the echoey recording of your kid’s school play, for example, and expect decent results. And, as you’d guess, thick accents dramatically impair the accuracy. (When you post the recording, you specify which accent the speaker has; that helps.)
The company acknowledges that it has some work to do, and says that it has big plans for Trint 2.0 this summer.
It’s therefore a welcome new weapon in the fight against the costly, time-consuming, soul-sucking act of transcribing the human voice.
Originally published here.