Fri, September 23, 2022, Ralf Hersel
Sometimes I’m really excited. So also last night, when the message flew through my timeline via a free speech-to-text engine. We know the company OpenAI as a leader in practical applications in the field of speech recognition and “artificial art”. You may have tried DALL·E to turn your thoughts into paintings. The San Francisco company is much better known for its GPT-3 language model.
All of OpenAI’s previous products have been commercial services, which impress, but were not available to everyone and also required a great deal of expertise to be used in everyday life. That changed last night. The company has released its new language model, Whisper, under the MIT license. And I tried it right away.
What’s that thing doing? It’s a neural network (don’t call it AI) that translates speech (in the form of an audio file) into written text. What do you need that for? For example, to convert meetings into written minutes, or to transcribe podcasts.
Whisper comes with five language models:
- Tiny: 39MB
- Base (base): 74MB
- Small: 244MB
- Normal (medium): 769MB
- Large: 1550MB
Except for the large model, the four smaller ones only support the English language.
Until now, there have been some cloud services that people would rather not trust with this task for money or data. The good news is that Whisper is a) free software, b) very easy to use, and c) gives convincing results. If you are interested in the details of the system, you can read them under the source or at Heise. I’m interested if it works.
So far, you could craft an STT engine from the services, models and instructions with PyTorch, HuggingFace Transformers and a lot of specialist knowledge. But that was hardly feasible for the halfway interested user. Now the tide has turned, and that’s a good thing.
First you can check whether ffmpeg installed on your system, which is usually the case. Just type in the terminal ffmpeg on, then you will see it, or you will receive an installation request if it is missing.
Then you create a subdirectory with any name, e.g. B. whisper. Now navigate to this directory in the terminal: CD whisper and runs this command:
pip install git+https://github.com/openai/whisper.git
If you use the Python installer pip If you don’t have it, you can install it from your distribution’s software store. That’s it already; nothing else needs to be installed. What you need now is an English language audio file. For my experiment, I extracted just under 1 minute from the last episode of the LateNightLinux podcast. Here is the file you should listen to to compare the transcription:
In the next step, this audio file is converted into text. To do this, you load the audio file (or any other English-language file) under the name latenightlinux.mp3 into the directory you created earlier and use this simple command:
whisper latenightlinux.mp3 --model medium
The command is self-explanatory: whisper will access the file latenightlinux.mp3 applied using the medium language model (769 MB). Now you must have patience. Depending on the performance of your computer, it will take about 15 minutes for the transcript to be created. You can follow this in the terminal:
While whisper does its job, you can follow the transcription in 30 second steps in the terminal. Finally, there is also the text file in the directory: latenightlinux.mp3.txt
But to distract ourselves from this, this is horribly sad day. Let’s talk about our discoveries. Will, what is Navidrome? Previously on discoveries you would have learned how I got my audiobook out of audible and turned into an mp3 on m4a or whatever it is that you want, which is all well and good but I do not want to carry something like. I don’t know seven or eight gigs worth of audio. It’s quite a lengthy tome around on my phone like about my phone storage is very expensive. Cloud storage is very cheap and my data plan on my phone is also very cheap and so what I want to do is store it in the cloud and stream it to my phone like you would do with a Spotify or one of those sorts of things.
Now you can put on your headphones and compare the audio file with the transcribed text. I think you’ll come to the same conclusion as me: It’s very good. Now think about how you can use Whisper for your purposes.