Audio and video producers have long sought the holy grail of a quick, accurate and affordable automated transcription service. Computer transcription holds the promise of not only making the production process faster and easier but also making audio and video programs more discoverable online by rendering their spoken content searchable.
Pop Up Archive has been offering its transcription service since 2014, with the backing of several nonprofits. Trint, a competing service, debuted in 2016 and has quickly gained ground. I’ve used both, and in my experience Trint is much more accurate, though still hardly perfect. (In a well-mic’d conversation between two clearly-enunciating broadcast professionals, Trint transcribed the word “Trint” variously as “it,” “trend,” “print,” “Trent,” “trance” and “tranche.”)
Trint is web-based; you upload your audio to Trint.com, it processes it for a few minutes and serves you a time-coded transcription inside the web app, from which you can email files or export them to your local drive. Where Trint really shines is with the audio player that is seamlessly integrated into its text editor, which allows you to easily hear what you see in real time and to make edits accordingly.
The big catch: cost. Trint currently costs $12-15 per hour of audio, depending on which monthly plan you buy — the same basic pricing structure as Pop Up Archive’s. Since cost is always top-of-mind for people in public media, I asked Trint founder Jeff Kofman if he could ever drop the price and go for more of a volume-based business model.
Kofman, it turns out, is a former public media journalist himself. He spoke to me from Trint’s office in London, and I edited our conversation below (which was excerpted on my podcast The Pub) for clarity and brevity.
Kofman: I spent 30 years as a broadcast journalist, starting in Canada [at the CBC in Toronto] and then at CBS Evening News and ABC News in the U.S. and over here in London. I figure I spent thousands of hours of my life transcribing my own interviews, news conferences, lectures, etc., and as a journalist, I think you really need to know your content. I’m a big believer in that. But that doesn’t mean you need to be a stenographer stuck in a dark room with headphones on for two days in this kind of purgatory because you let your interviews run long.
A couple of years ago, while I was researching some university courses I was teaching, I met this brilliant team of developers who’d done an experiment with transcription, tethering the text to the audio, kind of like karaoke. They were using manual transcription, but I was so impressed with what they’d done, and I said, “Well wait a minute, Siri and all those fancy speech-text-services, aren’t they now good enough that you could try to use them on this technology?”
I remember Laurian [Gridinoc], who became our senior developer, saying to me with his ironic smile, “It’s an interesting idea, we could try it.” And I just felt like I was seeing the future. So we got together and we started building Trint in December 2014. We were a team of four. It took us several efforts to get something that actually did what we hoped it would do, and we released it in September of 2016.
People ask me, “Do you miss your old job as a journalist, as a foreign correspondent, as a war correspondent?” I don’t have time. We’re riding a rocket, and it’s really hard, it’s exhausting, but it’s really exhilarating too.
Current: As it exists now, I think most people use Trint as a tool that allows them to transcribe their raw tape, to wrap their brain around it, and to find the clips they want to pull. Where do you see the product going? Is that its final form, or is it going to be more of a publication platform?
Kofman: Trint really is about making the transcription workflow process faster, easier and more affordable. Jobs that would take you several hours transcribing a couple of hours of interviews — we can get you those hours back in a number of minutes. You can search keywords, you can find them, verify them. There’s some work on your part, depending on audio quality, but you can get at your content fast and you can get on with writing your story.
When we look ahead 12, 18 months, I think Trint is only realizing maybe 20 percent of its potential. We see Trint evolving beyond transcription into an end-to-end publishing platform, taking raw recorded content — whether it’s audio or video — putting it into Trint, getting it back very quickly, cleaning it, polishing it, and putting it out there, whether it’s on social media, whether it’s making it SEO-friendly on your website. Then [we’ll be] adding all sorts of feature sets that allow you to do a lot of really interesting things: to move the text and the audio and/or video around the screen, to create new products, new documents.
When I started in broadcasting as a TV reporter, we just did two minutes a night. That was our job. By the time I left — and every journalist knows this — I was doing social media on Facebook and on Twitter, I was doing podcasting, I was doing an online story, I was doing radio. It doesn’t matter where you work today, everybody’s expected to do that, and Trint is building a platform for the multimedia work.
Current: The people reading this are in public broadcasting — most of them at local stations, some at national networks — but the common thread among them is that they’re poor. The number-one piece of feedback that they asked me to pass on to you is that they love Trint, but it’s just too expensive for them, so they only use it in emergencies. A lot of people say — and I’m one of them — that they would use Trint for every project if it was cheaper, and they’d probably end up paying you more money in the long run as a result. What do you say to that?
Kofman: I respect that perspective. The first thing I would say is we have real costs. This isn’t like Photoshop, where you download it onto your computer, and whether you use it for an hour or 100 hours in a month, it doesn’t cost Adobe any more. We have real computing costs associated with each minute, each hour [of audio transcribed]. So, we’re not yet in a position where we can say, “Pay a monthly membership, and you can do all-you-can-eat.” Obviously, that’s where a lot of users would like us to get to, and we may be able to get there, we see some routes to that. The technology is still in development. Speech science is — in some levels — a mature science, but in other levels it’s a frontier.
“I think question marks will come at some point in the next year or two. Commas are really, really difficult to write an algorithm around.”
The funny thing about this comment about it being too expensive is a lot of people say to us “By the way, I’d pay twice for this service. You guys are really undercharging.” I’m sensitive and respectful of both perspectives. I think that as Trint evolves — and remember, it’s only been to market a couple of months — you’ll see us start to offer tiered levels of services. If you consider what’s out now as “Trint Basic,” we’ll start to add feature sets: “Trint Bronze, Trint Silver, Trint Gold,” or whatever we decide to call them. The objective would be to keep “Trint Basic” as affordable as possible so that people like you and the people you’re talking about don’t see price as a barrier to entry.
You know, Trint is a word we invented for “transcription interview,” our users use it as a verb and a noun. I want it to be in the dictionary, and the way that’s going to happen is if users like you use us every day, and they just say “Let’s Trint it. Where’s the Trint? Get it Trinted and get on with it.” If we can make it more affordable and more indispensable, then it’s a win for everyone.
Current: Do you think it’s always going to need to be a web-based service, for technological reasons or for business reasons? Or could it be a downloadable plugin that you could use in the Adobe Suite, for example? (My note: The real holy grail for me is getting accurate speech-to-text integrated with editing software.)
Kofman: I think we will be able to liberate it from the web, in some form, in the next two years. The issue is going to be whether we can offer the same degree of accuracy. Would you sacrifice a small percentage of accuracy for having it untethered and for having it unlimited? That’s the research we’re doing now.
Current: Is that because the databases on which Trint relies to do its analysis are so huge that users couldn’t store them on their local drives?
Kofman: Massive, massive, yeah. We’re talking about cloud computing on a very large scale to get the kind of accuracy that we give back. And I would challenge anyone to find a more accurate transcript on a consistent level. Where we’re not good is with grammar. We only have periods, we don’t have question marks or commas. I think question marks will come at some point in the next year or two. Commas are really, really difficult to write an algorithm around.
Current: Most humans screw up their commas, too.
Kofman: Yeah, not to mention semicolons. The other area we’re not great with yet is “diarization,” which is differentiating speakers. We know the answers. It’s a lot of work for us to build it, but we actually have solutions in hand, we just need to be able to build them, and I think that’s going to take us a year.
Originally published here.