You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "frank (JIRA)" <ji...@apache.org> on 2013/12/24 08:48:53 UTC
[jira] [Commented] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856214#comment-13856214 ]
frank commented on TIKA-93:
---------------------------
this feature is really useful and helpful.
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
Re: [jira] [Commented] (TIKA-93) OCR support
Posted by Oleg Tikhonov <ol...@apache.org>.
Hi Frank,
It's not so easy especially having dependency on native libraries.
It's also depends on "trained" profiles, languages & fonts.
The questions are - what are platforms we want to support. what are
languages and fonts.
BR,
Oleg
On Tue, Dec 24, 2013 at 9:48 AM, frank (JIRA) <ji...@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856214#comment-13856214]
>
> frank commented on TIKA-93:
> ---------------------------
>
> this feature is really useful and helpful.
>
> > OCR support
> > -----------
> >
> > Key: TIKA-93
> > URL: https://issues.apache.org/jira/browse/TIKA-93
> > Project: Tika
> > Issue Type: New Feature
> > Components: parser
> > Reporter: Jukka Zitting
> > Priority: Minor
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
>