You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Pei Chen (JIRA)" <ji...@apache.org> on 2012/11/07 01:52:13 UTC

[jira] [Commented] (TIKA-93) OCR support

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491996#comment-13491996 ] 

Pei Chen commented on TIKA-93:
------------------------------

Have you seen JavaOCR (pure java ocr and BSD licensed): http://sourceforge.net/projects/javaocr/
I have not tried it out myself yet (looks like 1.0 was just released about 1 week ago).
I think a pure java implementation may be easier than forking another process (exec cpp) or introduce jni dependencies.
If interested, I could give it a whirl the next chance I get...

                
> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (TIKA-93) OCR support

Posted by Oleg Tikhonov <ol...@apache.org>.

Hey,
I've tried to look up the distribution, however could not find the sources,
in binaries they provide only Nokia distribution.

It would be nice if you could play with it and say your impression(s).

BR,
Oleg


On Wed, Nov 7, 2012 at 2:52 AM, Pei Chen (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491996#comment-13491996]
>
> Pei Chen commented on TIKA-93:
> ------------------------------
>
> Have you seen JavaOCR (pure java ocr and BSD licensed):
> http://sourceforge.net/projects/javaocr/
> I have not tried it out myself yet (looks like 1.0 was just released about
> 1 week ago).
> I think a pure java implementation may be easier than forking another
> process (exec cpp) or introduce jni dependencies.
> If interested, I could give it a whirl the next chance I get...
>
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Priority: Minor
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>