Sep 24, 2022 5 min read Machine Learning

Language learning with Whisper

Whisper, a new automatic speech recognition (ASR) model can help you to improve your listening skills and learn new words. A few lines of code can make a big difference! Available in Python and command line.

OpenAI, a Silicon Valley AI firm founded by Elon Musk, released an open-source automatic speech recognition (ASR) system called Whisper that can transcribe numerous languages. Improved robustness to accents, background noise, and technical language is due to the fact that Whisper was trained on 680'000 hours of multilingual data.

In this blog post, I will use Whisper as a tool to help you practice your listening skills for language learning. The following sections will show you how to install Whisper and use it to transcribe audio files.

What is Automatic Speech Recognition (ASR)?

Automatic speech recognition (ASR) is the process of converting speech into text automatically.

ASR systems are composed of two main components: an acoustic model and a language model. The acoustic model is trained on audio data and primarily consists of a hidden Markov model (HMM). The language model is trained on text data and can be either a statistical language model (SLM) or a neural network language model (NNLM).

Whisper' models are trained on 680,000 hours of audio, and the corresponding transcripts are gathered from the internet. Roughly 65% of the data (or 438,000 hours) is in English, with matching English transcripts (or roughly 18%), while 17% (or 117,000 hours) is in non-English and accompanied by a translated text. This non-English material comprises 98 different languages.

Transformer model: Encoder-decoder

A sequence-to-sequence Transformer model is trained on many different speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.

Installation

Dependencies

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

pip3 install setuptools-rust

Clone project

pip3 install git+https://github.com/openai/whisper.git

How about language learning?

One of the best ways to learn a language is by listening to audio recordings of native speakers. This will help you to hear the correct pronunciation and also to learn new words and expressions.

Additionally, audio recordings can help you to better understand the flow of the language and the rhythm of the language. This can be especially helpful when you are starting out with a new language and are still working on your basic skills.

How do I use it?

I started to learn Mandarin ~3 years ago, and I've been struggling with listening for quite a long time. FYI There are four tones in Mandarin, and a change in tone can completely alter the meaning of a word.

Basically, what I'm doing to practice listening by myself, is to watch kids TV shows. Especially because some of them are quite "easy" to understand, and it's a nice way to learn new words.

1. Find a video

Let's take this video as our example. The videos have subtitles, and we will use those subtitles to compare the model' accuracy.

2.Convert the video to mp3

Well, there's some websites to convert youtube video to mp3...

3.Split the audio

Being able to replay a though part of the audio greatly helps to make it easier to understand.

I'm splitting the audio in several chunks based on the silent parts:

from pydub import AudioSegment
from pydub.silence import split_on_silence

if __name__ == '__main__':
	# get audio file
    sound_file = AudioSegment.from_mp3("audio.mp3")
    
    # split in chunks
    audio_chunks = split_on_silence(sound_file, min_silence_len=200, silence_thresh=-40)

    for i, chunk in enumerate(audio_chunks):
        out_file = "chunk{0}.mp3".format(i)
        # export audio
        chunk.export(out_file, format="mp3")

4.Run the model on a audio chunk

# load model
model = whisper.load_model("medium")

# let's get the text from the audio, I explicitly say the language is mandarin (zh)
result = model.transcribe("chunk2.mp3", language='zh')
print(result["text"])

5. Complete code

import whisper
from pydub import AudioSegment
from pydub.silence import split_on_silence

if __name__ == '__main__':

    # load model
    model = whisper.load_model("small")

    # get audio file
    sound_file = AudioSegment.from_mp3("audio.mp3")

    # split in chunks
    audio_chunks = split_on_silence(sound_file, min_silence_len=200, silence_thresh=-40)

    for i, chunk in enumerate(audio_chunks):
        out_file = "chunk{0}.mp3".format(i)
        print("exporting", out_file)

        # export audio
        chunk.export(out_file, format="mp3")

            # let's get the text from the audio
        result = model.transcribe(out_file, language='zh')
        print(result["text"])

Accuracy

Let's compare a few audio

Original text	base model	small model	medium model	large model
沒問題啦	✓沒問題啦 (1.27s)	✓沒問題啦(3.63s)	✓沒問題啦(10.22s)	✓沒問題啦(16.63s)
馬鈴薯要四顆,小黃瓜要兩條哦	馬鈴手要四顆,小黃瓜要兩條哦 (2.97s)	馬鈴薯鴨四顆,小黃瓜要兩條哦!(8.89s)	✓馬鈴薯要四顆,小黃瓜要兩條哦(24.51s)	✓馬鈴薯要四顆,小黃瓜要兩條哦(41.00s)
那我去買東西了路上小心哦	✓那我去買東西了路上小心哦 (5.86s)	✓那我去買東西了路上小心哦!(17.25s)	那我去買東西了路上小心喔(48.69s)	✓那我去買東西了路上小心哦(81.89s)

Obviously, the accuracy improves depending on the model used. For a time/precision ratio, I think the medium model is enough.

Conclusion

That's it! This short blog post demonstrate how to easily install and use Whisper, and how remarquable the accuracy is. How about you? Tell me how you will use Whisper if you also use it for language learning :)