What is Speech Recognition
Speech recognition refers to the process of recognizing and understanding spoken language. Input comes in the form of audio data, and the speech recognizers will process this data to extract meaningful information from it. This has a lot of practical uses, such as voice controlled devices, transcription of spoken language into words, security systems, and so on.
Speech signals are very versatile in nature. There are many variations of speech in the same language. There are different elements to speech, such as language, emotion, tone, noise, accent, and so on. It’s difficult to rigidly define a set of rules that can constitute speech. Even with all these variations, humans are really good at understanding all of this with relative ease. Hence, we need machines to understand speech in the same way.
Reading and plotting audio data
Let’s take a look at how to read an audio file and visualize the signal. This will be a good starting point, and it will give us a good understanding about the basic structure of audio signals. Before we start, we need to understand that audio files are digitized versions of actual audio signals. Actual audio signals are complex continuous-valued waves. In order to save a digital version, we sample the signal and convert it into numbers. For example, speech is commonly sampled at 44100 Hz. This means that each second of the signal is broken down into 44100 parts, and the values at these timestamps are stored. In other words, you store a value every 1/44100 seconds. As the sampling rate is high, we feel that the signal is continuous when we listen to it on our media players.
How to do?
First of all create a new Python file, and import the following packages:
import numpy as np import matplotlib.pyplot as plt from scipy.io import wavfile
We will use the wavfile package to read the audio file from the input_read.wav input file that is already provided to you:
# Read the input file sampling_freq, audio = wavfile.read('input_read.wav')
Let’s print out the parameters of this signal:
# Print the params print '\nShape:', audio.shape print 'Datatype:', audio.dtype print 'Duration:', round(audio.shape / float(sampling_freq), 3), 'seconds'
The audio signal is stored as 16-bit signed integer data. We need to normalize these values:
# Normalize the values audio = audio / (2.**15)
Let’s extract the first 30 values to plot .
# Extract first 30 values for plotting audio = audio[:30]
The X-axis is the time axis. Let’s build this axis, considering the fact that it should be scaled using the sampling frequency factor:
# Build the time axis x_values = np.arange(0, len(audio), 1) / float(sampling_freq)
Convert the units to seconds:
# Convert to seconds x_values *= 1000
Let’s plot this as follows:
# Plotting the chopped audio signal plt.plot(x_values, audio, color='black') plt.xlabel('Time (ms)') plt.ylabel('Amplitude') plt.title('Audio signal') plt.show()
The full code is in the read_plot.py file. If you run this code, you will see the following signal:
You will also see the following printed on your Terminal:
Shape: (142300) Datatype: int16 Duration: 4.0 seconds