You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2014/02/07 19:00:28 UTC

[jira] [Commented] (TIKA-93) OCR support

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894779#comment-13894779 ] 

Grant Ingersoll commented on TIKA-93:
-------------------------------------

I'm noodling around with producing a patch for this and have a few questions for the group:

# Where in Tika do people usually put these kind of "downstream" tasks?  Presumably we would need to work with the mime type detection process to know that the input is something that is binary and potentially OCR-able.  I would imagine we would want something that inserts between Detection and Parsing.  I'd also suggest we make it pluggable, so that we can support other OCR solutions.
# Is anyone aware of anything in PDFBox that allows you to know if a document is an Image based PDF?





> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)