Python Machine Learning With Speech Recognition

What is Speech Recognition

Speech recognition refers to the process of recognizing and understanding spoken language. Input comes in the form of audio data, and the speech recognizers will process this data to extract meaningful information from it. This has a lot of practical uses, such as voice controlled devices, transcription of spoken language into words, security systems, and so on.

Speech signals are very versatile in nature. There are many variations of speech in the same language. There are different elements to speech, such as language, emotion, tone, noise, accent, and so on. It’s difficult to rigidly define a set of rules that can constitute speech. Even with all these variations, humans are really good at understanding all of this with relative ease. Hence, we need machines to understand speech in the same way.

Reading and plotting audio data

Let’s take a look at how to read an audio file and visualize the signal. This will be a good starting point, and it will give us a good understanding about the basic structure of audio signals. Before we start, we need to understand that audio files are digitized versions of actual audio signals. Actual audio signals are complex continuous-valued waves. In order to save a digital version, we sample the signal and convert it into numbers. For example, speech is commonly sampled at 44100 Hz. This means that each second of the signal is broken down into 44100 parts, and the values at these timestamps are stored. In other words, you store a value every 1/44100 seconds. As the sampling rate is high, we feel that the signal is continuous when we listen to it on our media players.

How to do?

First of all create a new Python file, and import the following packages:

import numpy as np
import matplotlib.pyplot as plt
from import wavfile

We will use the wavfile package to read the audio file from the input_read.wav input file that is already provided to you:

# Read the input file
sampling_freq, audio ='input_read.wav')

Let’s print out the parameters of this signal:

# Print the params
print '\nShape:', audio.shape
print 'Datatype:', audio.dtype
print 'Duration:', round(audio.shape[0] / float(sampling_freq), 3), 'seconds'

The audio signal is stored as 16-bit signed integer data. We need to normalize these values:

# Normalize the values
audio = audio / (2.**15)

Let’s extract the first 30 values to plot .

# Extract first 30 values for plotting
audio = audio[:30]

The X-axis is the time axis. Let’s build this axis, considering the fact that it should be scaled using the sampling frequency factor:

# Build the time axis
x_values = np.arange(0, len(audio), 1) / float(sampling_freq)

Convert the units to seconds:

# Convert to seconds
x_values *= 1000

Let’s plot this as follows:

# Plotting the chopped audio signal
plt.plot(x_values, audio, color='black')
plt.xlabel('Time (ms)')
plt.title('Audio signal')

The full code is in the file. If you run this code, you will see the following signal:

You will also see the following printed on your Terminal:

Shape: (142300)
Datatype: int16
Duration: 4.0 seconds
Muhammad Mubeen

Muhammad Mubeen

Mubeen is a full-stack web & mobile app developer who is very proficient in MEAN.js, Vue, Python, Ionic 4, Flutter, Firebase, ROR, and PHP. He has created multiple mobile and web applications. He is very passionate about sharing his knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *