Wired: Transcribing Audio Sucks—So Make the Machines Do It

Transcribing Audio Sucks—So Make the Machines Do It

January 8, 2018

AN UNPRECEDENTED VOICE-TRANSCRIPTION technology can tell you not only what’s being said, but who is saying it.

The web app, named Trint, can listen to an audio recording or a video of two or more speakers (or just one) engaged in natural speech, then provide a written transcript of what was said. Unlike Siri or Google Talk, Trint is designed to transcribe long blocks of text.

While news organizations have invested heavily in video content, the ability to optimize those clips for search engines remains elusive.
Trint’s technology is still nascent, but it could eventually give new life to vast swaths of non-text-based media on the internet, like videos and podcasts, by making them readable to both humans and search engines. People could read podcasts they lack the time or ability to listen to. YouTube videos could be indexed with a time-coded transcript, then searched for text terms. There are other applications too: Filmmakers could index their footage for better organization, and journalists, researchers, and lawyers could save the many hours it takes to transcribe long interviews.

As machine learning and automation technologies continue to transform the 21st century (and especially journalism), voice transcription remains a pesky speed bump. Transcription in particular is a technology that some have spent decades pursuing and others deemed outright impossible in our lifetimes. While news organizations and social media outlets alike have invested heavily in video content, the ability to optimize those clips for search engines remains elusive. And with younger readers still preferring print to video anyway, the value of transcribed text remains high.

Look Who’s Talking

Based in London and launched in autumn 2016, Trint is a web app built on two separate but entwined elements. The company’s transcription algorithm feeds text into a browser interface for editing, which links the words in a transcript directly to the corresponding points in the recording. While the accuracy is hardly perfect (as Trint’s founders are the first to admit), the system almost always produces a transcript that’s clean enough for searching and editing. At roughly 25 cents per minute (or $15 per hour), Trint’s software-as-service costs a quarter of the $1 per minute rate offered by competitors. There’s a reason Trint is so cheap: Those other services, like Casting Words and 3Play, use humans to clean up automated transcripts or to do the actual transcribing. Trint is all machines.

Microsoft has released voice recognition toolkits for programmers to experiment with, and Google just last week added multi-voice recognition to its Google Home smart speaker. But Trint’s software was the first public-facing commercial product to serve this space.

According to lead engineer Simon Turvey, Trint users report an error rate of between five and 10 percent for cleanly recorded audio. Though this is close to the eight percent industry standard estimated last year by veteran Microsoft scientist Xuedong Huang, the Trint founders consider their product’s editing function to be the feature that gives them a stronger competitive edge. Trint’s time-coded transcript and the web-based editor allows users to quickly find and work on the quotes they need.

Are You Getting This?

Putting on headphones and logging mind-numbing hours of transcription work has long been a thankless grunt job (and rite of initiation) at major news organizations. Anybody who has ever transcribed has almost surely wished for even a partial solution. Trint’s team knows this pain.

“I’ve lived this problem and hated transcribing,” says Trint CEO and co-founder Jeff Kofman, a veteran journalist with a career spanning three decades. “I figure I’ve spent thousands of hours of my life transcribing interviews and speeches and news conferences, searching through tapes.”

Gerald Friedland, director of the audio and multimedia lab at the International Computer Science Institute, was roundly impressed by the interface. Using a standard test from ICSI’s Meeting Recorder Project, Friedland pushed Trint by feeding it a conversation recorded with an iPhone containing “overlap, laughter, emotions, and other things that occur in a meeting.” Then he observed how accurately Trint maps the fragments of everyday human speech, if not all of the nuances. “Overall, I think it has a state-of-the-art performance with a nice interface,” he says. “The issue is [that] a nice interface is really still needed as the transcription has to be heavily edited after the recognition.”

Though not yet able to fully differentiate between voices, and gumming up especially when two subjects speak at once, Trint did well while transcribing an interview I recorded in a noisy midtown Manhattan cafe. Another interview recording, conducted in a rock club with a band intermittently soundchecking in the background, likewise produced a workable transcript.

A truer battle test for Trint, however, was provided by two of the most identifiable and infamously unintelligible voices of the past half-century: Keith Richards and the late Hunter S. Thompson. With whole chapters of Fear & Loathing in Las Vegas made up of transcribed field recordings of Thompson’s 1971 Vegas rampage, it was left to a poor (but heroic) editorial assistant to turn the tapes into printed words. With Trint, Thompson might have found somebody else to terrorize.

Trint’s transcription of a 1993 television conversation between Thompson and Keith Richards, however, read like somebody far more smashed than even Thompson. The software did surprisingly well with Richards’ grizzled British accent, but the late gonzo journalist’s half-mumbled word clusters seemed to break Trint’s brain entirely. The app simply skipped the offending phrases when Thompson talked too fast, as he did most of the time.

Read That Back to Me

Trint can currently understand 13 languages, including several varieties of English accents. Since it’s a cloud-based application, Trint’s voice transcription algorithm can be updated frequently to add new languages, new accents (Cuban-accented English is tough), and fresh batches of proper nouns.

But just as Trint might provide a solution to journalists, it could equally provide a technological home for isolated voice research projects without commercial outlets. Trint is partnering with one researcher to address diarization, the accurate identification of individual speakers. Turvey and his team also eventually see Trint being used by conferences, which can provide transcripts of all the presentations once an event is over. But the primary market, says Turvey, is still media organizations looking to make several generations of their unlogged video and audio instantly searchable.

“If we can deliver accurate raw text,” he says, “then that becomes a platform upon which the semantic analysis, the sentiment analysis, the topic extraction—all of those things—can then be layered.”

Though Trint’s founders are fond of deploying their brand’s name as both a noun and a verb, many journalists—or anybody who’s spent any time transcribing—might simply find it synonymous with “amazing.”

Originally published here.

We’re always happy to chat about the innovative tech we work on here at Trint, so don’t hesitate to get in touch with us for all media enquiries at victoria@trint.com.