Ever thought about speech to text transcription? Got a pile of voice recordings that you’re trying to convert to text?
Well read on!
googleLanguageR is a simple but powerful R-utility that uses the Google Speech engine to translate spoken word to text, and vice-versa.
Amazing!
Although the text to speech facility is really interesting, and could be used in any application that aims to synthesize human speech, I’ll limit my discussion today on the speech-to-text application.
As an aside, I explored several other solutions including paying a transcription company (too expensive), using dictation software (error prone and adds a middleman) and manual transcribing (no time stamping). As a data scientist who loves writing in R, I was also looking for something that leveraged my mad skills in the area (read needing something simple to code).
Why do you need to read this?
- Explains how to easily transcribe voice to text
- Uses plain language to navigate Google-eze
- Describes the various specialist configurations available
How to start coding?
The package googleLanguageR is available via the normal cran-r repository. Install it with the following code:
install.packages(“googleLanguageR”) #download and install package library(googleLanguageR) #load the package
You also need to apply for a Google Speech API. This is not so clear in the accompanying R-documentation (It took me a while to figure out why my code wasn’t running!).
This link will take you to a sign in page, where you can apply for an API token.
Approval takes only a minute or two, and the API can be downloaded as a JSON file. This is quite important because loading the JSON file is needed before running the googleLanguageR script. The API application allows you to name your JSON file any way you like. Below I’ve named it ‘speech’. I’m also using the ‘here’ package which I encourage everyone to use. I’ll be writing more about the amazing ‘here’ package very soon.
Save the JSON file to a folder immediately under your r-script file or project file, so ‘here’ can find it easily. gl_auth directs googleLanguageR to the JSON API file. This is basically a licence that allows you to use the Google Speech translation engine.
gl_auth <- gl_auth(here("speech.json")) #Activate the downloaded .JSON file.
Once you’ve installed the package and the API JSON key, you’re ready to go!
The Code
We start by running a configuration file. This file contains all the specialist modules that Google Speech makes available.
my_config <- list( diarizationConfig = list( enableSpeakerDiarization = TRUE, #enables tagging each speaker in a recording minSpeakerCount = 2, #the minimum no. of speakers maxSpeakerCount = 3), #the maximum no. of speakers and allows up to 6. enableWordTimeOffsets = TRUE, #this adds start and end times of each transcribed word in milleseconds model = “phone_call”, #other specialised options are available, but only in certain languages e.g. “video”, “answering_machine”. profanityFilter = FALSE, #can be useful when transcribing for public use enableAutomaticPunctuation = FALSE #can be useful when transcribing for public use. )
Here is a full list of specialist transcription models available for each supported language. US-english and French seem to offer the most customisability.
Next comes the main section of code. This utilises the configuration file in the final line.
test <- here("speech.wav") #load your sound file translate <- gl_speech( test, #directs to your local .wav file encoding = "MULAW", #tells the package which encoding the file is in. Possible formats include .wav (MULAW and LINEAR16), .FLAC or .mp3. sampleRateHertz = 8000, #plug in the sampling rate. Defaults to 16000 languageCode = "en-US", #This is where you chose your language engine asynch = TRUE, #Asynchronous transcribing = True/False see below for more info. customConfig = my_config #refers to the configuration code above. )
Here is a full list of file formats supported. Generally, .wav, .flac, and .mp3 are supported. Only certain codecs of .wav are supported, and it’s worth looking at the link to ensure you have the right file. Google being Google are quite specific about file type, and it can make or break your code!
Either synchronous or asynchronous transcribing is determined by the asynch command. Synchronous transcribing occurs as the code being run. Either synchronous or asynchronous transcribing is available for recordings less than one minute duration.
Asynchronous transcribing is the only option for recordings longer than 60 seconds. With asynchronous transcribing, the file is sent to the cloud and transcribed remotely. This takes several seconds before the result is available.
If you run the next statement every 5-10 seconds, Google Speech will tell you when it’s finished. The length of the recording influences this timeframe.
result <- gl_speech_op(translate) > 2021-07-20 08:54:24 -- - started at 2021-07-19T22:54:19.881859Z - last update: 2021-07-19T22:54:21.576319Z #This spits back some information to tell you that the engine is still running.
The following message is delivered, once Google Speech has concluded.
> 2021-07-18 21:00:59 -- Asynchronous transcription finished.
The next code statement reveals the transcription, buried at the second level of the data frame.
result$transcript$transcript #spits back the transcript. [1] mary had a little lamb
And that’s it!
Important things to note about speech to text transcription
- Google speech to text transcription only allows recordings less than 10mb. Google speech will spit out an error message if you attempt anything over this limit. This can pose problems for longer recordings. The workaround is to split longer sound files and rejoin them following transcription.
- The enableWordConfidence option is only available to organisations approved for pre-release testing. If you’re one of the lucky few, enableWordConfidence will compute a confidence score for each word transcribed. enableVoiceConfidence is useful in determining speaker clarity over time.
- Be aware that real world recordings featuring noisy environments will degrade transcription accuracy. For example googleLanguageR has transcribed some telephone calls I’m researching with an accuracy of only 55-70%. Google Speech won’t transcribe muffled words very well or might even mistranscribe them.
- Play around with the language engine. For instance the recordings I’m researching are in Australian English, however US-English does a far better job of transcribing than the AU-English engine!
- In theory phone-call and other specialised transcribing engines will compensate for background noise, however in practice improvements are not noticeable.
Conclusion
This amazing little R-package essentially translates r-programming into JSON script, leveraging the power of the Google Speech. Tt will perform admirably where the recording is of good quality with an absence of background noise. However, be aware that background sound or poor voice clarity will degrade transcription quality.
Google speech to text transcription can easily feed into to other text analysis workflows, such as word disambiguation, and topic clustering. But more on that next time!
I would like to thank you for the efforts youve put in writing this website. Im hoping to check out the same high-grade blog posts from you later on as well. In truth, your creative writing abilities has inspired me to get my very own blog now 😉
I was excited to uncover this website. I need to to thank you for ones time for this particularly fantastic read!! I definitely really liked every part of it and i also have you bookmarked to look at new things in your site.
Itís hard to find experienced people about this topic, but you seem like you know what youíre talking about! Thanks
I want to to thank you for this very good read!! I certainly enjoyed every little bit of it. I have got you book-marked to check out new stuff you postÖ
Good post. I learn something totally new and challenging on blogs I stumbleupon on a daily basis. Its always useful to read content from other authors and practice something from their websites.