Comparing two recorded voices

I need to find literature on how to compare a realtime recorded voice (from a microphone) from a database of pre-recorded voices. After the comparison, I would need to output the percentage of it.

I am researching audio fingerprint but I cannot find any results in any literature of such an implementation. Any expert here who can easily help me achieve this?

+3


source to share


3 answers


I've done this kind of work before, so I might be the right person to describe this procedure to you.

I had clean recordings of sounds that I considered the gold standards. I wrote python scripts to convert these sounds as an array of MFCC vectors. Read more about MFCCs here .

MFCC extraction can be seen as the first step in audio file processing, i.e. functions that are good for identifying acoustic content. I created an MFCC for every 10ms and had 39 attributes. So a 5 second sound file had about 500 MFCCs, each with 39 attributes.

Then I wrote artificial neural network code on these lines.More about the neural network can be read from here .

I then train the weights and biases of the neural network, commonly known as network parameters, using a stochastic gradient descent algorithm trained using a backpropagation algorithm. The trained model was then saved to identify unknown sounds.

The new sounds were then represented as a sequence of MFCC vectors and given as input to the neural network. The neural network can predict for each MFCC instance obtained from the new sound file into one of the sound classes that the neural network trains. The number of correctly classified MFCC instances gives the accuracy with which the neural network could classify an unknown sound.



Consider, for example: you are training your neural network on 4 types of sounds, 1. whistle, 2. car horn, 3. dog bark, and 4. siren using the above procedure.

A 5 s siren sounds in the new sound. You will receive approximately 500 copies of the MFCC. The trained neural network will try to classify each MFCC instance into one of the classes that the neural network learns. So you can end up with something like this.

30 examples were classified as whistle. 20 were classified as car horn / 10 were classified as dog bark and the rest were correctly classified as siren.

The classification accuracy, or rather the similarity between sounds, can be roughly calculated as the ratio of the number of correctly classified specimens to the total number of specimens, which in this case will be 440/500, which is 88%. This field is relatively new and a lot of work has been done prior to using similar machine learning algorithms such as Hidden Markov Model , Vector Machine Support, and more.

This issue has already been solved before and you can find a research paper about it on google Scientar.

+3


source


There is no expert in this area (so please contact accordingly), but you should look at:

How to approach?



  1. filter votes

    the recognizable minimum of speech is up to 0.4-3.4 KHz

    (which is why they are used in old telephone filters). The human voice usually has a frequency of up to 12.7 KHz

    so if you are sure that you have unfiltered recordings, filter out to 12.7 KHz

    and also 50Hz

    or 60Hz

    from power lines

  2. Make dataset

    if you have a record of the same sentence for comparison, then you can simply calculate the spectrum using DFFT or DFCT of the same tone / letter (e.g. start, middle, end). Filter out unused areas, make voice print dataset from data. If not, then first you need to find similar tones / letters in the recordings, for this you need speech recognition to be sure, or find parts in the recording that have similar properties. What you should learn (by trial or studying speech recognition documents), here are some clues: tempo, dynamic volume range, frequency ranges.

  3. compare dataset

    the numerical comparison is done on the correlation coefficient which is pretty simple (and my favorite), you can also use a neural network for this (even bullet 2) and there might also be some fuzzy approach for that. I recommend using correlation because its output is similar to what you want and it is deterministic, so there is no problem with over / under training or wrong architecture, etc.

[edit1]

People also use Furmant filters to generate vocals and speech. Their properties mimic the pathways of human vocalization, and the mathematics behind them can also be used in speech recognition by checking the fundamental frequencies of a filter that you can detect by vocals, intonation, tempo ... which can be used directly for speech detection. However, this is far outside my area of ​​expertise, but there are many articles on this subject, so just google ...

0


source


This is definitely not a trivial problem.

If you're seriously trying to solve it, I suggest you take a close look at how speech coders work.

A rough breakdown of the steps:

  • Determine the intervals in the recording containing vowels
  • Determine the fundamental frequency and harmonics of a vowel sound
  • Determine the relative amplitude of the harmonics and the average frequency of the fundamental
  • Develop a “distance” metric that measures how close two vowel sounds are to each other based on the parameters from step 3
  • Calculate the distance from the vowel sounds of the new entry to the entries in the database.

The parameters from step 3 are sort of a fingerprint of the vocal tract. As a rule, the sounds of a consonant are not distinct enough to be significant (unless the vowels are heard from two persons, very similar).

As a first and very simple step, try to identify the middle fundamental vowel and use that frequency as your signature.

Good luck,

Jens

0


source







All Articles