How to recognize when START & STOP user is talking in Android? (Voice recognition in Android)

I did a lot of R&D and spent a lot of resources to solve my problem, but I DID NOT DELETE to get the right solution.

I have developed an application, now I want to add Voice functionality to it .

Required functions:

1) when USER starts to speak, it should record audio / video and

2) when the user stops talking, he should play the recorded audio / video .

Note . Here, video means what the user is performing on the app during that time period. For example, clicks on buttons or some kind of animation, etc.

I don't want to use the Google Voice Recognizer available by default in Android as it requires internet, but the app works offline. Also I got to know CMU-Sphinx . But this is not useful as per my requirements.

EDITED: - Also, I would like to add that I achieved this using the Start and Stop buttons, but I don't want to use those buttons.

If anyone has any ideas or any suggestions please let me know.


source to share

3 answers

The simplest and most common method is to count the number of zero crossings in the sound (i.e. when the sign changes from positive to negative).

If this value is too high, the sound is unlikely to be speech. If it is too small, then, again, it is unlikely to be a speech.

Combine that with a simple energy level (how loud the sound is) and you've got a solution that's pretty robust.

If you want a more accurate system, it gets much more complicated. One way is to extract audio features ( MFCC , for example) from the "training data", simulate it with something like GMM , and then test the features you extract from live audio against GMM. Thus, you can model the likelihood that a given audio frame is speech over non-speech. However, this is not an easy process.

I would highly recommend dropping the zero crossing lines as it's easy to implement and works great 99% of the time :)



You can try adding listeners to application events like navigation, click animation, etc ... in the listener implementation you can trigger start / stop functions ...

Take a look at these examples ... it might be helpful for you ....

but I am wondering if what you described about the behavior of your application is similar to what you are about to reinvent, like saying tom huh ??? :-P



below is the code I am using for an iPhone application that does exactly the same. The code is in Objective-C ++, but there are many comments in it. This code is executed inside a callback function on the write queue. I'm sure a similar approach exists for the Android platform.

This approach works very well in almost every acoustic environment in which I have used it and is used in our application. You can download it to check it out if you like.

Try to implement it on the Android platform and you're done!

// If there are some audio samples in the audio buffer of the recording queue
if (inNumPackets > 0) {
        // The following 4 lines of code are vector functions that compute 
        // the average power of the current audio samples. 
        // Go [here][2] to view documentation about them. 
        vDSP_vflt16((SInt16*)inBuffer->mAudioData, 1, aqr->currentFrameSamplesArray, 1, inNumPackets);
        vDSP_vabs(aqr->currentFrameSamplesArray, 1, aqr->currentFrameSamplesArray, 1, inNumPackets);
        vDSP_vsmul(aqr->currentFrameSamplesArray, 1, &aqr->divider, aqr->currentFrameSamplesArray, 1, inNumPackets);
        vDSP_sve(aqr->currentFrameSamplesArray, 1, &aqr->instantPower, inNumPackets);
        // InstantPower holds the energy for the current audio samples
        aqr->instantPower /= (CGFloat)inNumPackets;
        // S.O.S. Avoid +-infs, NaNs add a small number to InstantPower
        aqr->instantPower = log10f(aqr->instantPower + 0.001f);
        // InstantAvgPower holds the energy for a bigger window 
        // of time than InstantPower
        aqr->instantAvgPower = aqr->instantAvgPower * 0.95f + 0.05f * aqr->instantPower;
        // AvgPower holds the energy for an even bigger window 
        // of time than InstantAvgPower
        aqr->avgPower = aqr->avgPower * 0.97f + 0.03f * aqr->instantAvgPower;
        // This is the ratio that tells us when to record
        CGFloat ratio = aqr->avgPower / aqr->instantPower;
        // If we are not already writing to an audio file and 
        // the ratio is bigger than a specific hardcoded value 
        // (this value has to do with the quality of the microphone 
        // of the device. I have set it to 1.5 for an iPhone) then start writing!
        if (!aqr->writeToFile && ratio > aqr->recordingThreshold) {
            aqr->writeToFile = YES;
        if (aqr->writeToFile) {
            // write packets to file
            XThrowIfError(AudioFileWritePackets(aqr->mRecordFile, FALSE, inBuffer->mAudioDataByteSize,
                                                inPacketDesc, aqr->mRecordPacket, &inNumPackets, inBuffer->mAudioData),
                          "AudioFileWritePackets failed");
            aqr->mRecordPacket += inNumPackets;
            // Now if we are recording but the instantAvgPower is lower 
            // than avgPower then we increase the countToStopRecording counter
            if (aqr->instantAvgPower < aqr->avgPower) {
            // or else set him to 0.
            else {
                aqr->countToStopRecording = 0;
            // If we have detected that there is not enough power in 30 consecutive
            // audio sample buffers OR we have recorded TOO much audio 
            // (the user speaks for more than a threshold of time) stop recording 
            if (aqr->countToStopRecording > 30 || aqr->mRecordPacket > kMaxAudioPacketsDuration) {
                aqr->countToStopRecording = 0;
                aqr->writeToFile = NO;
                // Notify the audio player that we finished recording 
                // and start playing the audio!!!
                dispatch_async(dispatch_get_main_queue(), ^{[[NSNotificationCenter defaultCenter] postNotificationName:@"RecordingEndedPlayNow" object:nil];});





All Articles