You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by David Pilato <da...@pilato.fr> on 2019/03/02 14:04:20 UTC

OCR Strategy ocr_only extracts also text

Hey team,


I'm wondering if I'm misunderstanding the purpose of ocr_only in the PDFParser.

I have a PDF which is containing a text within an image block and a text.


When I run Tika with a PDFParser configured with:

> quote_type
> PDFParser pdfParser = new PDFParser();
> pdfParser.setOcrStrategy("ocr_only");
> Parser PARSERS[] = new Parser[2];
> PARSERS[0] = new DefaultParser();
> PARSERS[1] = pdfParser;
> Parser parser = new AutoDetectParser(PARSERS);

Both text are extracted from the PDF file.
I'd have expected that:


• no_ocr does not do any OCR (this is working fine: "This file contains some words." text is not extracted but "This file also contains text." is)
• ocr_and_text extracts both (this is working: "This file contains some words." and "This file also contains text." texts are extracted)
• ocr_only extracts only OCR based text (this is not working as both "This file contains some words." and "This file also contains text." texts are extracted where I'd expect to have only "This file contains some words.").

Is my understanding of the ocr_only value incorrect? This page (https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is saying:

> quote_type
> For ocrStrategy, we currently have: no_ocr (rely on regular text extraction only), ocr_only (don't bother extracting text, just run OCR on each page), ocr_and_text (both extract text and run OCR).

Thanks!

Re: OCR Strategy ocr_only extracts also text

Posted by Tim Allison <ta...@apache.org>.

Sorry for my delay.  I'm not able to replicate this behavior. :(

When I parse this file:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPDFVarious.pdf

This way:
        PDFParser pdfParser = new PDFParser();
        pdfParser.setOcrStrategy("ocr_only");
        ContentHandler handler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        try (InputStream is =
getResourceAsStream("/test-documents/testPDFVarious.pdf")) {
            pdfParser.parse(is, handler, metadata, parseContext);
        }

Or better:
        AutoDetectParser parser = new AutoDetectParser();

        PDFParserConfig pdfParserConfig  = new PDFParserConfig();
        pdfParserConfig.setOcrStrategy("ocr_only");
        ParseContext parseContext = new ParseContext();
        parseContext.set(PDFParserConfig.class, pdfParserConfig);
        ContentHandler handler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();

        try (InputStream is =
getResourceAsStream("/test-documents/testPDFVarious.pdf")) {
            parser.parse(is, handler, metadata, parseContext);
        }

I'm only seeing a <div class="ocr"/>...

When I run this with "ocr_and_text", I get the extracted text and the <div
class="ocr">... too...

Help!

On Sat, Mar 9, 2019 at 7:44 AM David Pilato <da...@pilato.fr> wrote:

> So I tried with
>
> Parser parser = new AutoDetectParser(pdfParser);
>
> And with:
>
> Parser parser = pdfParser;
>
> I'm still seeing the same behavior.
> Does it look like an issue? Or something wrong on my side (well this is
> often the case :) ).
>
>
> Le 7 mars 2019 à 01:30 +0100, David Pilato <da...@pilato.fr>, a écrit :
>
> Sadly not yet. I added this on my todo but what you said makes sense to
> me.
>
> I'll check this later.
>
>
> Thanks for answering ! 🤗
> Le 6 mars 2019 à 23:11 +0100, Tim Allison <ta...@apache.org>, a écrit :
>
> David,
>  Are you all set w this or are there still surprises?
>
> On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <ta...@apache.org> wrote:
>
>> Hi David,
>>  I’m afk...take following w grain of salt. If you aren’t excluding the
>> PDFParser from your DefaultParser, there’s a chance that one is being
>> called rather than the one you’re adding.
>>   Try creating a PDFParserConfig, setting it as you want, add it to the
>> ParseContext that you send into the parse() on the regular DefaultParser.
>>   If you’re still finding surprises, please let us know.
>>
>>     Best,
>>
>>       Tim
>>
>> On Sat, Mar 2, 2019 at 9:04 AM David Pilato <da...@pilato.fr> wrote:
>>
>>> Hey team,
>>>
>>>
>>> I'm wondering if I'm misunderstanding the purpose of ocr_only in
>>> the PDFParser.
>>>
>>> I have a PDF which is containing a text within an image block and a text.
>>>
>>> <D64DD4D0-2F44-4C21-A3D0-79D8CFAA00CA.png>
>>> When I run Tika with a PDFParser configured with:
>>>
>>> PDFParser pdfParser = new PDFParser();
>>> pdfParser.setOcrStrategy("ocr_only");
>>> Parser PARSERS[] = new Parser[2];
>>> PARSERS[0] = new DefaultParser();
>>> PARSERS[1] = pdfParser;
>>> Parser parser = new AutoDetectParser(PARSERS);
>>>
>>>
>>> Both text are extracted from the PDF file.
>>> I'd have expected that:
>>>
>>>
>>>    - *no_ocr* does not do any OCR (this is working fine: "This file
>>>    contains some words." text is not extracted but "This file also
>>>    contains text." is)
>>>    - *ocr_and_text* extracts both (this is working: "This file contains
>>>    some words." and "This file also contains text." texts are extracted)
>>>    - *ocr_only* extracts only OCR based text (this is not working as
>>>    both "This file contains some words." and "This file also contains
>>>    text." texts are extracted where I'd expect to have only "This file
>>>    contains some words.").
>>>
>>> Is my understanding of the *ocr_only* value incorrect? This page (
>>> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is
>>> saying:
>>>
>>> For ocrStrategy, we currently have: *no_ocr* (rely on regular text
>>> extraction only), *ocr_only* (don't bother extracting text, just run
>>> OCR on each page), *ocr_and_text* (both extract text and run OCR).
>>>
>>>
>>> Thanks!
>>>
>>>

Re: OCR Strategy ocr_only extracts also text

Posted by David Pilato <da...@pilato.fr>.

So I tried with

Parser parser = new AutoDetectParser(pdfParser);

And with:

Parser parser = pdfParser;

I'm still seeing the same behavior.
Does it look like an issue? Or something wrong on my side (well this is often the case :) ).


Le 7 mars 2019 à 01:30 +0100, David Pilato <da...@pilato.fr>, a écrit :
> Sadly not yet. I added this on my todo but what you said makes sense to me.
>
> I'll check this later.
>
>
> Thanks for answering ! 🤗
> Le 6 mars 2019 à 23:11 +0100, Tim Allison <ta...@apache.org>, a écrit :
> > David,
> >  Are you all set w this or are there still surprises?
> >
> > > On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <ta...@apache.org> wrote:
> > > > Hi David,
> > > >  I’m afk...take following w grain of salt. If you aren’t excluding the PDFParser from your DefaultParser, there’s a chance that one is being called rather than the one you’re adding.
> > > >   Try creating a PDFParserConfig, setting it as you want, add it to the ParseContext that you send into the parse() on the regular DefaultParser.
> > > >   If you’re still finding surprises, please let us know.
> > > >
> > > >     Best,
> > > >
> > > >       Tim
> > > >
> > > > > On Sat, Mar 2, 2019 at 9:04 AM David Pilato <da...@pilato.fr> wrote:
> > > > > > Hey team,
> > > > > >
> > > > > >
> > > > > > I'm wondering if I'm misunderstanding the purpose of ocr_only in the PDFParser.
> > > > > >
> > > > > > I have a PDF which is containing a text within an image block and a text.
> > > > > >
> > > > > > <D64DD4D0-2F44-4C21-A3D0-79D8CFAA00CA.png>
> > > > > > When I run Tika with a PDFParser configured with:
> > > > > >
> > > > > > > PDFParser pdfParser = new PDFParser();
> > > > > > > pdfParser.setOcrStrategy("ocr_only");
> > > > > > > Parser PARSERS[] = new Parser[2];
> > > > > > > PARSERS[0] = new DefaultParser();
> > > > > > > PARSERS[1] = pdfParser;
> > > > > > > Parser parser = new AutoDetectParser(PARSERS);
> > > > > >
> > > > > > Both text are extracted from the PDF file.
> > > > > > I'd have expected that:
> > > > > >
> > > > > >
> > > > > > • no_ocr does not do any OCR (this is working fine: "This file contains some words." text is not extracted but "This file also contains text." is)
> > > > > > • ocr_and_text extracts both (this is working: "This file contains some words." and "This file also contains text." texts are extracted)
> > > > > > • ocr_only extracts only OCR based text (this is not working as both "This file contains some words." and "This file also contains text." texts are extracted where I'd expect to have only "This file contains some words.").
> > > > > >
> > > > > > Is my understanding of the ocr_only value incorrect? This page (https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is saying:
> > > > > >
> > > > > > > For ocrStrategy, we currently have: no_ocr (rely on regular text extraction only), ocr_only (don't bother extracting text, just run OCR on each page), ocr_and_text (both extract text and run OCR).
> > > > > >
> > > > > > Thanks!
> > > > > >

Re: OCR Strategy ocr_only extracts also text

Posted by David Pilato <da...@pilato.fr>.

Sadly not yet. I added this on my todo but what you said makes sense to me.

I'll check this later.


Thanks for answering ! 🤗
Le 6 mars 2019 à 23:11 +0100, Tim Allison <ta...@apache.org>, a écrit :
> David,
>  Are you all set w this or are there still surprises?
>
> > On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <ta...@apache.org> wrote:
> > > Hi David,
> > >  I’m afk...take following w grain of salt. If you aren’t excluding the PDFParser from your DefaultParser, there’s a chance that one is being called rather than the one you’re adding.
> > >   Try creating a PDFParserConfig, setting it as you want, add it to the ParseContext that you send into the parse() on the regular DefaultParser.
> > >   If you’re still finding surprises, please let us know.
> > >
> > >     Best,
> > >
> > >       Tim
> > >
> > > > On Sat, Mar 2, 2019 at 9:04 AM David Pilato <da...@pilato.fr> wrote:
> > > > > Hey team,
> > > > >
> > > > >
> > > > > I'm wondering if I'm misunderstanding the purpose of ocr_only in the PDFParser.
> > > > >
> > > > > I have a PDF which is containing a text within an image block and a text.
> > > > >
> > > > > <D64DD4D0-2F44-4C21-A3D0-79D8CFAA00CA.png>
> > > > > When I run Tika with a PDFParser configured with:
> > > > >
> > > > > > PDFParser pdfParser = new PDFParser();
> > > > > > pdfParser.setOcrStrategy("ocr_only");
> > > > > > Parser PARSERS[] = new Parser[2];
> > > > > > PARSERS[0] = new DefaultParser();
> > > > > > PARSERS[1] = pdfParser;
> > > > > > Parser parser = new AutoDetectParser(PARSERS);
> > > > >
> > > > > Both text are extracted from the PDF file.
> > > > > I'd have expected that:
> > > > >
> > > > >
> > > > > • no_ocr does not do any OCR (this is working fine: "This file contains some words." text is not extracted but "This file also contains text." is)
> > > > > • ocr_and_text extracts both (this is working: "This file contains some words." and "This file also contains text." texts are extracted)
> > > > > • ocr_only extracts only OCR based text (this is not working as both "This file contains some words." and "This file also contains text." texts are extracted where I'd expect to have only "This file contains some words.").
> > > > >
> > > > > Is my understanding of the ocr_only value incorrect? This page (https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is saying:
> > > > >
> > > > > > For ocrStrategy, we currently have: no_ocr (rely on regular text extraction only), ocr_only (don't bother extracting text, just run OCR on each page), ocr_and_text (both extract text and run OCR).
> > > > >
> > > > > Thanks!
> > > > >

Re: OCR Strategy ocr_only extracts also text

Posted by Tim Allison <ta...@apache.org>.

David,
 Are you all set w this or are there still surprises?

On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <ta...@apache.org> wrote:

> Hi David,
>  I’m afk...take following w grain of salt. If you aren’t excluding the
> PDFParser from your DefaultParser, there’s a chance that one is being
> called rather than the one you’re adding.
>   Try creating a PDFParserConfig, setting it as you want, add it to the
> ParseContext that you send into the parse() on the regular DefaultParser.
>   If you’re still finding surprises, please let us know.
>
>     Best,
>
>       Tim
>
> On Sat, Mar 2, 2019 at 9:04 AM David Pilato <da...@pilato.fr> wrote:
>
>> Hey team,
>>
>>
>> I'm wondering if I'm misunderstanding the purpose of ocr_only in
>> the PDFParser.
>>
>> I have a PDF which is containing a text within an image block and a text.
>>
>>
>> When I run Tika with a PDFParser configured with:
>>
>> PDFParser pdfParser = new PDFParser();
>> pdfParser.setOcrStrategy("ocr_only");
>> Parser PARSERS[] = new Parser[2];
>> PARSERS[0] = new DefaultParser();
>> PARSERS[1] = pdfParser;
>> Parser parser = new AutoDetectParser(PARSERS);
>>
>>
>> Both text are extracted from the PDF file.
>> I'd have expected that:
>>
>>
>>    - *no_ocr* does not do any OCR (this is working fine: "This file
>>    contains some words." text is not extracted but "This file also
>>    contains text." is)
>>    - *ocr_and_text* extracts both (this is working: "This file contains
>>    some words." and "This file also contains text." texts are extracted)
>>    - *ocr_only* extracts only OCR based text (this is not working as
>>    both "This file contains some words." and "This file also contains
>>    text." texts are extracted where I'd expect to have only "This file
>>    contains some words.").
>>
>> Is my understanding of the *ocr_only* value incorrect? This page (
>> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is
>> saying:
>>
>> For ocrStrategy, we currently have: *no_ocr* (rely on regular text
>> extraction only), *ocr_only* (don't bother extracting text, just run OCR
>> on each page), *ocr_and_text* (both extract text and run OCR).
>>
>>
>> Thanks!
>>
>>

Re: OCR Strategy ocr_only extracts also text

Posted by Tim Allison <ta...@apache.org>.

Hi David,
 I’m afk...take following w grain of salt. If you aren’t excluding the
PDFParser from your DefaultParser, there’s a chance that one is being
called rather than the one you’re adding.
  Try creating a PDFParserConfig, setting it as you want, add it to the
ParseContext that you send into the parse() on the regular DefaultParser.
  If you’re still finding surprises, please let us know.

    Best,

      Tim

On Sat, Mar 2, 2019 at 9:04 AM David Pilato <da...@pilato.fr> wrote:

> Hey team,
>
>
> I'm wondering if I'm misunderstanding the purpose of ocr_only in
> the PDFParser.
>
> I have a PDF which is containing a text within an image block and a text.
>
>
> When I run Tika with a PDFParser configured with:
>
> PDFParser pdfParser = new PDFParser();
> pdfParser.setOcrStrategy("ocr_only");
> Parser PARSERS[] = new Parser[2];
> PARSERS[0] = new DefaultParser();
> PARSERS[1] = pdfParser;
> Parser parser = new AutoDetectParser(PARSERS);
>
>
> Both text are extracted from the PDF file.
> I'd have expected that:
>
>
>    - *no_ocr* does not do any OCR (this is working fine: "This file
>    contains some words." text is not extracted but "This file also
>    contains text." is)
>    - *ocr_and_text* extracts both (this is working: "This file contains
>    some words." and "This file also contains text." texts are extracted)
>    - *ocr_only* extracts only OCR based text (this is not working as both "This
>    file contains some words." and "This file also contains text." texts are
>    extracted where I'd expect to have only "This file contains some words.
>    ").
>
> Is my understanding of the *ocr_only* value incorrect? This page (
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is saying:
>
> For ocrStrategy, we currently have: *no_ocr* (rely on regular text
> extraction only), *ocr_only* (don't bother extracting text, just run OCR
> on each page), *ocr_and_text* (both extract text and run OCR).
>
>
> Thanks!
>
>