Tesseract OCR: recognize only dictionary words

I am using tesseract OCR plugin for phone conversation: https://github.com/jcesarmobile/PhonegapOCRPlugin/i

I am trying to set up tesseract to recognize only dictionary words. That is: no special characters, no suffixes or prefixes, etc.

Since the tessdata folder from this project does not contain any configs, I thought I would set configs to init. Right now I am trying to install configs by modifying claseAuxiliar.mm, but I cannot say that I noticed any difference, it could be due to incorrect configurations or that I am setting them incorrectly. Below are my settings and how I am trying to set them:

    // init the tesseract engine.
    tesseract = new tesseract::TessBaseAPI();
    tesseract->Init([dataPath cStringUsingEncoding:NSUTF8StringEncoding], "eng");
    if (!tesseract->SetVariable("segment_penalty_dict_nonword","10"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("segment_penalty_garbage","10"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("stopper_nondict_certainty_base","-100"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("language_model_penalty_non_dict_word","1"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("language_model_penalty_non_freq_dict_word","1"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("GARBAGE_STRING","5"))
    printf("Setting variable failed!!!\n");
    if (!tesseract->SetVariable("NON_WERD","5"))
    printf("Setting variable failed!!!\n");

      

+1


source to share


1 answer


You can try to suppress the system dictionary and load an alternative custom dictionary.



https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc

0


source







All Articles