How to increase the accuracy of speech to text

With just a few small adjustments, speech to text can be up to 99% accurate. Read our tips & secrets to getting top results from A.I. transcription services.

August 1, 2018

It's hard to imagine a more disastrous scenario for someone transcribing audio or video: you've completed your interview and recorded excellent quotes for your content, but when it comes to converting the speech to text, you find that the quality of your recording is too poor to decipher. All that work amounts to, in the end, an unusable recording.

If you've felt the sting of this scenario, you're not alone - and it can be hard to shake. Even when you feel like you've come up with the perfect recording, you're never too far from that pestering feeling that something could've gone wrong with the audio - it's a burden that doesn't really lift until the time comes to transcribe your file.

The less people talking over each other, the better the accuracy

We've talked a lot about AI transcription and speech to text before. The benefits that AI transcription brings in terms of productivity and time-saving measures, as well as the detection of multiple speakers and editing functions, enhance the working lives of industry professionals significantly. However, even the best automated services are not perfect, just as manual transcribers will inevitably make mistakes while transcribing audio and video content.

There are a few things that can negatively affect the accuracy of an automated transcript, including people talking at the same time (overtalk) and background noise during recording. The good news is many of the problems we face in getting that perfect, crystal clear audio-to-text conversion are easily preventable.

As the AI technology around speech to text services grows, so will the accuracy of transcription. Until the day comes when machines can generate transcripts with 100% accuracy, however, there are things you can do to create an improved environment and increase automated transcription accuracy for a seamless transcription.

Recording environment for the best speech to text results

Before uploading your recording for Trint's powerful language algorithm to convert the speech to text, you're asked to tick a few boxes to confirm the quality of the file. This quality control is a checklist to ensure that the resulting transcription will be as accurate as possible.

First, users are asked whether the background is noise-free. This is possibly the biggest trap that users can fall into, and one that even the most competent professionals fall foul of.

Crowded, noisy environments make for poor interview audio

A lot of environmental factors are out of your control. If you're conducting a journalistic interview, finding a good mutual location is difficult. Many professionals like the disarming appeal of a coffee shop or cafe to help the interviewee relax and feel comfortable enough to talk at length, but background noise is everywhere in public places like these.

If the dialogue you're recording can't take place in a private location, like a home or office, then make sure you find a place that's likely not to be busy. Google has a great algorithm that anticipates how busy certain public places are, and if you can find somewhere to meet at a time that isn't expected to be too crowded there's a good chance that there won't be any pesky background noise to get in the way - or at least not as much.

It's also vital that you position the microphone in a way that's close enough to record the conversation with minimal interference. Encourage the interviewee to speak clearly into the device that you're using to record, and bear in mind that their words are the most valuable thing in this situation - the mic should be positioned wherever is best to record their voice first, and yours second. When it comes time to convert speech to text, distance from the microphone determines a lot.

In many other professions, you might not be so lucky to find a quiet environment to record somebody's statements or thoughts; but don't worry, there are still ways to help optimize your audio after recording their words to an audio file.

Luckily, software like Audacity is on hand to help should background noise cause difficulty. Audacity is a free, open-source download that specializes in editing audio files. Through the program's features, you can select 'noise reduction' to sharpen your recording and give it the best chance of an accurate and interruption-free transcription.

One person at a time

Another great way to ensure your recordings are as clear as possible is to make sure there's little chance of overtalk.

Overtalk happens when two voices are speaking at the same time, sometimes when an interviewee begins to answer a question before the interviewer finishes asking. It's also common when more people's voices are being recorded.

One-on-one interviews are clearer when subjects speak one at a time

The best way to combat this is to ask short, sharp questions and ask the interviewee to elaborate if you need a longer answer. If you're recording the thoughts and statements of two or more people at the same time, it's useful to ask one person at a time, and if overtalk happens, to ask a follow-up question directed at an interviewee you're most likely to get a quote from.

Again, sometimes it's impossible to control the answers you receive, and an electronic transcription service will inevitably get a little confused if it identifies more than one voice at the same time. But thankfully the best speech to text services come with a timestamped editor that allows you to listen to yourself and transcribe the passages where multiple voices enter the equation.

Finally, we come to another thing that makes converting speech to text difficult: strong regional dialects. We're sure if you're experienced in transcribing voice-to-text then you've dealt with deciphering different dialects and accents. How much time have you spent on a recording when having to constantly rewind the same three-second passage to figure out what word is being said?

As you might expect, there isn't a clear solution when it comes to overcoming the issues of recording individuals with strong accents.

Speakers with regional accents can be aided with the English - All Accents algorithm

A great quirk about the human psyche is that we like to mimic others as a means of building a subconscious rapport with another person or group of people. When you're questioning someone and recording their answers, a great way to encourage them to lose their accent is by adopting a more formal tone for yourself.

We at Trint, however, have an alternative approach for English transcription: allow us to introduce English - All Accents. This powerful feature increases the accuracy of AI transcription when there's more than one accent in an audio file, as well as with strong regional accents.

While automated transcription has a bright future ahead, sometimes the results are clouded by interference. With this walkthrough of some of the peskiest problems that transcription software faces, you can bring the future of seamless transcriptions a little bit closer.

Start your free trial of Trint today