You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Horst Krause (JIRA)" <ji...@apache.org> on 2019/03/25 07:11:00 UTC
[jira] [Commented] (TIKA-2844) OCR_STRATEGY.OCR_ONLY does not extract any text

    [ https://issues.apache.org/jira/browse/TIKA-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800443#comment-16800443 ] 

Horst Krause commented on TIKA-2844:
------------------------------------

The following metadata is reported at the end (meta). Perhaps it helps to analyse the issue.
{code:java}
date=2019-03-15T10:23:51Z
pdf:PDFVersion=1.4
access_permission:modify_annotations=true
access_permission:can_print_degraded=true
dcterms:created=2019-03-15T10:23:51Z
Last-Modified=2019-03-15T10:23:51Z
dcterms:modified=20
19-03-15T10:23:51Z
dc:format=application/pdf;
version=1.4
xmpMM:DocumentID=uuid:30AB12E7-44E3-4011-8FA3-77A58A1BC349
Last-Save-Date=2019-03-15T10:23:51Z
access_permission:fill_in_form=true
pdf:docinfo:modified=2019-03-15
T10:23:51Z
meta:save-date=2019-03-15T10:23:51Z
pdf:encrypted=false
modified=2019-03-15T10:23:51Z
Content-Type=application/pdf
X-Parsed-By=org.apache.tika.parser.DefaultParser
X-Parsed-By=org.apache.tika.parser.pdf.PDFParser
X-Parsed-By=class org.apache.tika.parser.ocr.TesseractOCRParser
meta:creation-date=2019-03-15T10:23:51Z
created=2019-03-15T10:23:51Z
access_permission:extract_for_accessibility=true
access_permission:assemble_document=true
xmpTPg:NPages=1
Creation-Date=2019-03-15T10:23:51Z
access_permission:extract_content=true
access_permission:can_print=true
producer=EPSON Scan
access_permission:can_modify=true
pdf:docinfo:producer=EPSON Scan
pdf:docinfo:created=2019-03-15T10:23:51Z
{code}

> OCR_STRATEGY.OCR_ONLY does not extract any text
> -----------------------------------------------
>
>                 Key: TIKA-2844
>                 URL: https://issues.apache.org/jira/browse/TIKA-2844
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.20
>         Environment: Win7, 64-bit, Tesseract 4.1.0 and Image Magiick 7.0.8 installed
>            Reporter: Horst Krause
>            Priority: Major
>
> I have some PDF which were scanned including OCR with some other software. But the recognized text quality is quite poor. So I would like to ignore the text in the pdf and just do a new OCR with tesseract.
> So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text from the PDF. When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the original PDF.
> After trying several tutorials and examples, this is my code:
> {code:java}
> final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> final TikaConfig config = TikaConfig.getDefaultConfig();
> final String version = (new Tika(config)).toString();
> LOG.info("Tika version " + version + " / " + config.getParser().getClass().getName());
> final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
> final PDFParserConfig pdfConfig = new PDFParserConfig();
> pdfConfig.setExtractInlineImages(true);
> pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);
> final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
> tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
> tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
> tesserConfig.setEnableImageProcessing(1);
> final Parser parser = new AutoDetectParser();
> final Metadata meta = new Metadata();
> final ParseContext parsecontext = new ParseContext();
> parsecontext.set(Parser.class, parser);
> parsecontext.set(PDFParserConfig.class, pdfConfig);
> parsecontext.set(TesseractOCRConfig.class, tesserConfig);
> parser.parse(pdf, handler, meta, parsecontext);
> System.out.println("OCR Result: " + handler.toString());
> {code}
> My maven dependencies:
> {code:java}
> <dependency>
> <groupId>org.apache.tika</groupId>
> <artifactId>tika-parsers</artifactId>
> <version>1.20</version> <!-- 1.20 -->
> </dependency>
> <dependency>
> <groupId>com.levigo.jbig2</groupId>
> <artifactId>levigo-jbig2-imageio</artifactId>
> <version>1.6.5</version>
> </dependency>
> <dependency>
> <groupId>com.github.jai-imageio</groupId>
> <artifactId>jai-imageio-core</artifactId>
> <version>1.3.1</version> <!-- 1.4.0 -->
> </dependency>
> <dependency>
> <groupId>com.github.jai-imageio</groupId>
> <artifactId>jai-imageio-jpeg2000</artifactId>
> <version>1.3.0</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>jbig2-imageio</artifactId>
> <version>3.0.0</version>
> </dependency>
> {code}
>  
> As there is no error message or stack trace at all, I don't understand why I don't get any result. If it is not a bug, it should at least output some hint what's going wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)