You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/11/28 02:40:44 UTC

[jira] Commented: (TIKA-93) OCR support

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651448#action_12651448 ] 

Jukka Zitting commented on TIKA-93:
-----------------------------------

OCRopus (http://code.google.com/p/ocropus/) seems like a nice tool for this. It's a command like tool so we'd need to use something like the ExternalParser class to use it, but the annotated HTML output it generates is already very close to what Tika uses, so the integration should be easy.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.