Speaker Gender Detection from Sound Waveform Data

I would like to add gender detection functionality to a news video broadcasting application I am working on so that the application can switch between male and female voices according to the voice on the screen. I don't expect 100% accuracy. I used EZAudio to get the waveform data of the audio time period and used the average RMS to set the cutoff value between the male and female. Initially cutOff = 3.3.

    - (void)setInitialVoiceGenderDetectionParameters:(NSArray *)arrayAudioDetails
    {
        float initialMaleAvg = ((ConvertedTextDetails *)[arrayAudioDetails firstObject]).audioAverageRMS;
        // The average RMS value of a time period of Audio, say 5 sec
        float initialMaleVector = initialMaleAvg * 80;
        // MaleVector is the parameter to change the threshold according to different news clippings
        cutOff = (initialMaleVector < 5.3) ? initialMaleVector : 5.3;
        cutOff = (initialMaleVector > 23) ? initialMaleVector/2 : 5.3;
    }

      

Initially adjustValue = -0.9 and tanCutOff = 0.45. These values ​​5.3, 23, cutOff, adjustValue and tanCutOff are from rigorous testing. The tan values ​​are also used to increase the difference in values.

    - (BOOL)checkGenderWithPeekRMS:(float)pRMS andAverageRMS:(float)aRMS
{
    //pRMS is the peak RMS value in the audio snippet and aRMS is the average RMS value
    BOOL male = NO;
    if(tan(pRMS) < tanCutOff)
    {
        if(pRMS/aRMS > cutOff)
        {
            cutOff = cutOff + adjustValue;
            NSLog(@"FEMALE....");
            male = NO;
        }
        else
        {
            NSLog(@"MALE....");
            male = YES;
            cutOff = cutOff - adjustValue;
        }
    }
    else
    {
        NSLog(@"FEMALE.");
        male = NO;
    }

    return male;
}

      

The use of the adjustValue parameter is to calibrate the threshold every time a news video is translated, since each video has different noise levels. But I know that this method is noob-ish. What can I do to create a stable threshold? or How can I normalize each audio piece?

Alternative or more efficient ways to determine gender from audio waves are also encouraged.

Edit: From Nikolai's suggestion, I explored gender recognition using CMU Sphinx. Can anyone suggest how I can extract the MFCC features and feed into the GMM / SVM classifier using Open Ears (CMU Sphinx for iOS platform)?

+3


source to share


2 answers


Accurate gender identification can be realized using the GMM classifier of the MFCC functions. You can read about it here:

AGE AND GENDER RECOGNITION FOR TELEPHONE APPLICATIONS BASED ON GMM OBSERVERS AND VECTOR MACHINE SERVICES



By a date I am not aware of the open source implementation, although many components are available in open source speech recognition tools like CMUSphinx.

+1


source


Accurate gender identification can be realized with training in the GMM classifier by MFCC functions for men and women. Here's how to do it.

  • You need to collect a set of workouts for each gender.
  • Extract MFCC features from all gender matching sounds (python implementation can be found like scikit-talkbox, etc.).
  • Train the GMM as a gender on the extracted features from your audio training sets.


See the open source Python implementation here for details. The following guides evaluate subset code retrieved from Google AudioSet that was released this year (2017)

https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/

-1


source







All Articles