How to extract values ββfrom OCR Tika using Tesseract
Hello I am trying to extract text content from an image using Tesseract with Tika
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("/usr/local/bin/");
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
try {
parser.parse(stream, handler, metadata, parseContext);
} finally {
stream.close();
}
System.out.println(handler.toString());
Am I always getting blank? I'm not sure how to get content from the parser, can anyone help me?
+3
source to share
1 answer
You don't need to declare config.setTesseractPath("/usr/local/bin/")
if you got tesseract
in your system path, check it first, for example:
public static boolean checkIfExecutableInPath(String exec) {
String path = System.getenv("PATH");
if (StringUtils.isNotBlank(path)) {
for (String dir : path.split(":")) {
if (new File(dir, exec).exists()) {
return true;
}
}
}
return false;
}
And add to your code:
if (!checkIfExecutableInPath("tesseract")) {
config.setTesseractPath(pathToTesseractDir);
}
0
source to share