You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Horst Krause (JIRA)" <ji...@apache.org> on 2019/03/28 21:19:00 UTC
[jira] [Closed] (TIKA-2844) OCR_STRATEGY.OCR_ONLY does not extract any text

     [ https://issues.apache.org/jira/browse/TIKA-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Horst Krause closed TIKA-2844.
------------------------------

> OCR_STRATEGY.OCR_ONLY does not extract any text
> -----------------------------------------------
>
>                 Key: TIKA-2844
>                 URL: https://issues.apache.org/jira/browse/TIKA-2844
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.20
>         Environment: Win7, 64-bit, Tesseract 4.1.0 and Image Magiick 7.0.8 installed
>            Reporter: Horst Krause
>            Priority: Major
>
> I have some PDF which were scanned including OCR with some other software. But the recognized text quality is quite poor. So I would like to ignore the text in the pdf and just do a new OCR with tesseract.
> So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text from the PDF.
> When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the original PDF.
> I called tesseract binary in console and there the expected text was extracted.
> After trying several tutorials and examples, this is my code:
> {code:java}
> final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> final TikaConfig config = TikaConfig.getDefaultConfig();
> final String version = (new Tika(config)).toString();
> LOG.info("Tika version " + version + " / " + config.getParser().getClass().getName());
> final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
> final PDFParserConfig pdfConfig = new PDFParserConfig();
> pdfConfig.setExtractInlineImages(true);
> pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);
> final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
> tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
> tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
> tesserConfig.setEnableImageProcessing(1);
> final Parser parser = new AutoDetectParser();
> final Metadata meta = new Metadata();
> final ParseContext parsecontext = new ParseContext();
> parsecontext.set(Parser.class, parser);
> parsecontext.set(PDFParserConfig.class, pdfConfig);
> parsecontext.set(TesseractOCRConfig.class, tesserConfig);
> parser.parse(pdf, handler, meta, parsecontext);
> System.out.println("OCR Result: " + handler.toString());
> {code}
> My maven dependencies:
> {code:java}
> <dependency>
> <groupId>org.apache.tika</groupId>
> <artifactId>tika-parsers</artifactId>
> <version>1.20</version> <!-- 1.20 -->
> </dependency>
> <dependency>
> <groupId>com.levigo.jbig2</groupId>
> <artifactId>levigo-jbig2-imageio</artifactId>
> <version>1.6.5</version>
> </dependency>
> <dependency>
> <groupId>com.github.jai-imageio</groupId>
> <artifactId>jai-imageio-core</artifactId>
> <version>1.3.1</version> <!-- 1.4.0 -->
> </dependency>
> <dependency>
> <groupId>com.github.jai-imageio</groupId>
> <artifactId>jai-imageio-jpeg2000</artifactId>
> <version>1.3.0</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>jbig2-imageio</artifactId>
> <version>3.0.0</version>
> </dependency>
> {code}
>  
> As there is no error message or stack trace at all, I don't understand why I don't get any result. If it is not a bug, it should at least output some hint what's going wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)