How to extract values ​​from OCR Tika using Tesseract

Hello I am trying to extract text content from an image using Tesseract with Tika

Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();

        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setTesseractPath("/usr/local/bin/");
        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);



        try {
            parser.parse(stream, handler, metadata, parseContext);
        } finally {
            stream.close();
        }

        System.out.println(handler.toString());

      

Am I always getting blank? I'm not sure how to get content from the parser, can anyone help me?

+3


source to share


1 answer


You don't need to declare config.setTesseractPath("/usr/local/bin/")

if you got tesseract

in your system path, check it first, for example:

public static boolean checkIfExecutableInPath(String exec) {
    String path = System.getenv("PATH");
    if (StringUtils.isNotBlank(path)) {
        for (String dir : path.split(":")) {
            if (new File(dir, exec).exists()) {
                return true;
            }
        }
    }
    return false;
}

      



And add to your code:

if (!checkIfExecutableInPath("tesseract")) {
    config.setTesseractPath(pathToTesseractDir);
}

      

0


source







All Articles