How to count the number of words spoken with any method (SR or otherwise)

I'm having trouble getting pointers to how to accomplish what seems like a deceptively easy task:

Given the audio stream, how do you calculate the number of words that have been spoken in real time?

I don't need to understand what words are, but just have an accurate count of the words that have been spoken. The counter does not need to be too precise and may even consider utterances and other "grunts" such as coughing.

It seems that all speech recognition systems depend on a predefined grammar that must be provided before they can parse the phonemes that are said to be converted to known words with some degree of accuracy. But I don't care about accuracy in general, but about the speed of the spoken words.

The important thing is that this is done in real time and allows the system to provide alerts after a certain number of words have been spoken. The system will stimulate a visual cue to pause and then the speaker can continue.

I looked through the CMU Sphinx FAQ and found that the idea of ​​"word recognition" is not yet supported. I don't really need real-time search for specific words, but it comes close to what I'm looking for. Looking for very little silence in the waveform seems to be a very crude way to do this and probably not very accurate, but that's all I have for now.

Any pointers to algorithms, research papers, or any other ideas would be appreciated!

+3


source to share





All Articles