You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by David Pilato <da...@pilato.fr> on 2018/12/18 20:42:51 UTC

OCR and Raw text

Heya


When OCR is available, what should happen when I have a document containing both text and images with text.

For example I have a  PDF with a text "hello world" and an image containing "foo bar".
When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is.

If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.

Is that expected?
If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?

Thanks for your insights.


David

--
David Pilato, elastic.co
Developer | Evangelist,

Re: OCR and Raw text

Posted by David Pilato <da...@pilato.fr>.

Old thread but I never answered to this one so let's give a closure here :)


The latest tests I ran actually work as expecting. That was a mistake in my code which was causing a misconfiguration of the parsers.

Thanks Tim!


--
David Pilato, elastic.co
Developer | Evangelist,
Le 21 déc. 2018 à 16:38 +0100, Tim Allison <ta...@apache.org>, a écrit :
> Hi David,
> I'm sorry for my slow response!
>
> That behavior isn't expected. How have you configured Tika to run
> OCR on pdfs?
> 1) extractInlineImages
> 2) render the page and then run OCR
> a) no_ocr
> b) ocr_only
> c) ocr_and_text
>
> Is there any chance that "foo bar" is in the title of the PDF for the
> image-only pdf? We do write title info into the body.
>
>
>
>
> 1
>
> On Fri, Dec 21, 2018 at 8:04 AM David Pilato <da...@pilato.fr> wrote:
> >
> > Anyone knows?
> > I guess if no one I need to look at the code or use log debug. :)
> >
> >
> >
> > David
> >
> > --
> > David Pilato, elastic.co
> > Developer | Evangelist,
> > Le 18 déc. 2018 à 21:43 +0100, David Pilato <da...@pilato.fr>, a écrit :
> >
> > Heya
> >
> >
> > When OCR is available, what should happen when I have a document containing both text and images with text.
> >
> > For example I have a PDF with a text "hello world" and an image containing "foo bar".
> > When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is.
> >
> > If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.
> >
> > Is that expected?
> > If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?
> >
> > Thanks for your insights.
> >
> >
> > David
> >
> > --
> > David Pilato, elastic.co
> > Developer | Evangelist,

Re: OCR and Raw text

Posted by Tim Allison <ta...@apache.org>.

Hi David,
  I'm sorry for my slow response!

  That behavior isn't expected.  How have you configured Tika to run
OCR on pdfs?
1) extractInlineImages
2) render the page and then run OCR
    a) no_ocr
    b) ocr_only
    c) ocr_and_text

Is there any chance that "foo bar" is in the title of the PDF for the
image-only pdf?  We do write title info into the body.




1

On Fri, Dec 21, 2018 at 8:04 AM David Pilato <da...@pilato.fr> wrote:
>
> Anyone knows?
> I guess if no one I need to look at the code or use log debug. :)
>
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,
> Le 18 déc. 2018 à 21:43 +0100, David Pilato <da...@pilato.fr>, a écrit :
>
> Heya
>
>
> When OCR is available, what should happen when I have a document containing both text and images with text.
>
> For example I have a  PDF with a text "hello world" and an image containing "foo bar".
> When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is.
>
> If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.
>
> Is that expected?
> If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?
>
> Thanks for your insights.
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,

Re: OCR and Raw text

Posted by David Pilato <da...@pilato.fr>.

Anyone knows?
I guess if no one I need to look at the code or use log debug. :)



David

--
David Pilato, elastic.co
Developer | Evangelist,
Le 18 déc. 2018 à 21:43 +0100, David Pilato <da...@pilato.fr>, a écrit :
> Heya
>
>
> When OCR is available, what should happen when I have a document containing both text and images with text.
>
> For example I have a  PDF with a text "hello world" and an image containing "foo bar".
> When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is.
>
> If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.
>
> Is that expected?
> If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?
>
> Thanks for your insights.
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,