You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Horst Krause (JIRA)" <ji...@apache.org> on 2019/03/25 07:00:01 UTC

[jira] [Created] (TIKA-2844) OCR_STRATEGY.OCR_ONLY does not extract any text

Horst Krause created TIKA-2844:
----------------------------------

             Summary: OCR_STRATEGY.OCR_ONLY does not extract any text
                 Key: TIKA-2844
                 URL: https://issues.apache.org/jira/browse/TIKA-2844
             Project: Tika
          Issue Type: Bug
          Components: ocr
    Affects Versions: 1.20
         Environment: Win7, 64-bit, Tesseract 4.1.0 and Image Magiick 7.0.8 installed
            Reporter: Horst Krause


I have some PDF which were scanned including OCR with some other software. But the recognized text quality is quite poor. So I would like to ignore the text in the pdf and just do a new OCR with tesseract.

So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text from the PDF. When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the original PDF.

After trying several tutorials and examples, this is my code:
{code:java}
final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
final ByteArrayOutputStream out = new ByteArrayOutputStream();

final TikaConfig config = TikaConfig.getDefaultConfig();
final String version = (new Tika(config)).toString();
LOG.info("Tika version " + version + " / " + config.getParser().getClass().getName());

final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);

final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
tesserConfig.setEnableImageProcessing(1);

final Parser parser = new AutoDetectParser();
final Metadata meta = new Metadata();
final ParseContext parsecontext = new ParseContext();

parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);

parser.parse(pdf, handler, meta, parsecontext);
System.out.println("OCR Result: " + handler.toString());

{code}
As there is no error message or stack trace at all, I don't understand why I don't get any result. If it is not a bug, it should at least output some hint what's going wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)