You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2014/05/29 20:09:02 UTC

[jira] [Updated] (TIKA-93) OCR support

     [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Palsulich updated TIKA-93:
--------------------------------

    Attachment: TesseractOCR_Tyler.patch

Awesome! I attached another patch which includes TesseractOCRParser.patch with unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with text). We could use more tests for images with no next, blurry text, and so on. But, I don't know how good Tesseract is.

Steps to apply this patch: install Tesseract \[1\], apply the patch, move the test files into tika-parsers/src/test/resources/test-documents/ocr. Run the tests with {{mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest -DfailIfNoTests=false}}.

What needs to happen from here? How should we include Tesseract in the sources? How should we handle timeouts (give the user a warning that OCR can be slow/timed out)?

\[1\] - [https://code.google.com/p/tesseract-ocr/wiki/ReadMe]

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)