WebRTC: how to apply WebRTC VAD to audio via samples obtained from a WAV file
I am currently parsing wav files and saving samples in std::vector<int16_t> sample
. Now I want to apply VAD (Voice Activity Detection) to this data to find out the "regions" of the voice, or rather the beginning and end of words .
Developed wav files - 16 kHz, 16 bit PCM, mono. My code is in C ++.
I searched a lot about this but could not find the correct documentation regarding VAD WebRTC features.
From what I have found, the function I need to use is WebRtcVad_Process()
. The prototype is described below:
int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
size_t frame_length)
From what I found here: StackOverflow question.
Each frame of audio you send to VAD must be 10, 20, or 30 milliseconds long. Here's a schematic of an example assuming the audio_frame is 10ms (320 bytes) of audio at 16000 Hz:
int is_voiced = WebRtcVad_Process (vad, 16000, audio_frame, 160);
It makes sense:
1 sample = 2B = 16 bits
SampleRate = 16000 sample/sec = 16 samples/ms
For 10 ms, no of samples = 160
So, based on that, I implemented this:
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout<<ms<<" ms : "<<isActive<<std::endl;
temp = temp + 160; // processed 160 samples
}
Now I'm not sure if this is correct. Also, I'm also not sure if this gives me the correct result or not.
So,
- Is it possible to use samples processed directly from wav files, or is some processing needed?
- Am I looking for the right function to do the job?
- How to use the function to make VAD work properly in audio stream?
- Is it possible to distinguish the spoken words?
- What is the best way to check if the result I am getting is correct?
- If not, what is the best way to accomplish this task?
source to share
To begin with, no, I don't think you can segment a utterance into individual words using VAD. From the Wikipedia article on speech segmentation :
One would expect that the inter-word spaces used by many written languages, such as English or Spanish, would match pauses in theirs, but this is only true in very slow speech when the speaker deliberately inserts those pauses. In ordinary speech, one usually finds many consecutive words without pauses in between, and often the final sounds of one word blend smoothly or flame with the initial sounds of the next word.
However, I will try to answer your other questions.
-
Before starting VAD, you need to decode the WAV file, which can be compressed, into original PCM audio data. See Reading and Processing Data of WAV Files in C / C ++ . Alternatively, you can use something like
sox
to convert the WAV file to original sound before running your code. This command converts a WAV file of any format to 16 kHz, 16-bit PCM in the format that WebRTCVAD expects:sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
-
It looks like you are using the correct function. To be more specific, you should do this:
#include "webrtc/common_audio/vad/include/webrtc_vad.h" // ... VadInst *vad; WebRtcVad_Create(&vad); WebRtcVad_Init(vad); const int16_t * temp = sample.data(); for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10) { int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window std::cout << ms << " ms : " << isActive << std::endl; temp = temp + 160; // processed 160 samples (320 bytes) }
-
To see if it works, you can run known files and see if you get the expected results. For example, you can start by processing silence and confirm that you never (or rarely - this algorithm is not perfect) see the vocal result coming back from
WebRtcVad_Process
. Then you can try a file that is silence except for one short sentence in the middle, etc. If you want to compare against an existing test, the py-webrtcvad module has a unit test that does this; seetest_process_file
function . -
To perform word-level segmentation, you will probably need to find a speech recognition library that does it, or give you access to the information you need to do it. For example. this thread on the Kaldi mailing list seems to be talking about how to segment with words.
source to share