You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "frank (JIRA)" <ji...@apache.org> on 2013/12/24 08:48:53 UTC

[jira] [Commented] (TIKA-93) OCR support

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856214#comment-13856214 ] 

frank commented on TIKA-93:
---------------------------

this feature is really useful and helpful.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: [jira] [Commented] (TIKA-93) OCR support

Posted by Oleg Tikhonov <ol...@apache.org>.

Hi Frank,

It's not so easy especially having dependency on native libraries.
It's also depends on "trained" profiles, languages & fonts.

The questions are - what are platforms we want to support. what are
languages and fonts.

BR,
Oleg


On Tue, Dec 24, 2013 at 9:48 AM, frank (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856214#comment-13856214]
>
> frank commented on TIKA-93:
> ---------------------------
>
> this feature is really useful and helpful.
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Priority: Minor
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
>