You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Horst Krause (JIRA)" <ji...@apache.org> on 2019/03/25 07:00:01 UTC
[jira] [Created] (TIKA-2844) OCR_STRATEGY.OCR_ONLY does not extract
any text
Horst Krause created TIKA-2844:
----------------------------------
Summary: OCR_STRATEGY.OCR_ONLY does not extract any text
Key: TIKA-2844
URL: https://issues.apache.org/jira/browse/TIKA-2844
Project: Tika
Issue Type: Bug
Components: ocr
Affects Versions: 1.20
Environment: Win7, 64-bit, Tesseract 4.1.0 and Image Magiick 7.0.8 installed
Reporter: Horst Krause
I have some PDF which were scanned including OCR with some other software. But the recognized text quality is quite poor. So I would like to ignore the text in the pdf and just do a new OCR with tesseract.
So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text from the PDF. When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the original PDF.
After trying several tutorials and examples, this is my code:
{code:java}
final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
final ByteArrayOutputStream out = new ByteArrayOutputStream();
final TikaConfig config = TikaConfig.getDefaultConfig();
final String version = (new Tika(config)).toString();
LOG.info("Tika version " + version + " / " + config.getParser().getClass().getName());
final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);
final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
tesserConfig.setEnableImageProcessing(1);
final Parser parser = new AutoDetectParser();
final Metadata meta = new Metadata();
final ParseContext parsecontext = new ParseContext();
parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);
parser.parse(pdf, handler, meta, parsecontext);
System.out.println("OCR Result: " + handler.toString());
{code}
As there is no error message or stack trace at all, I don't understand why I don't get any result. If it is not a bug, it should at least output some hint what's going wrong.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)