You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2022/04/29 14:54:28 UTC

What may have changed in ODT parser in Tika 2

Hi Tim, All

I have a simple test reading a string content from an ODT doc failing, PDF,
Excel are good, but something is going on with the ODT parsing.

quarkus.odt in
https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
is expected to return a "Hello Quarkus" string

but now the test fails with

Expected: is "Hello Quarkus"
  Actual: Thumbnails/thumbnail.png.

AutoDetectParser is used to parse, using a standard sequence

https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85

May be it is an auto-detection issue, the media type which is used is here:

https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25

Any hints will be appreciated

Thanks, Sergey

Re: What may have changed in ODT parser in Tika 2

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim

Thanks for this fix, confirmed all works now with 2.4.1

Cheers, Sergey

On Mon, May 2, 2022 at 6:43 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim
>
> I gave it another try and it looks like only the thumbnail file name is
> reported, `ToTextContentHandler` is used by default
>
> I can try again with 2.4.1 RC later
>
> Thanks, Sergey
>
>
> On Sat, Apr 30, 2022 at 2:08 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
>> Hi Tim
>>
>> Thanks for a quick fix, missed your answer yesterday,  will check soon
>> and let you know.
>>
>> Cheers Sergey
>>
>>
>> On Fri 29 Apr 2022, 16:49 Tim Allison, <ta...@apache.org> wrote:
>>
>>> Hi Sergey,
>>>   That the thumbnail file name showed up in the stream is a bug I
>>> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
>>> I just fixed it now (TIKA-3745).
>>>   Are you not seeing "Hello Quarkus" at all, or is it just not the
>>> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
>>> least the 2.4.0-rc1.
>>>
>>> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Tim, All
>>> >
>>> > I have a simple test reading a string content from an ODT doc failing,
>>> PDF,
>>> > Excel are good, but something is going on with the ODT parsing.
>>> >
>>> > quarkus.odt in
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
>>> > is expected to return a "Hello Quarkus" string
>>> >
>>> > but now the test fails with
>>> >
>>> > Expected: is "Hello Quarkus"
>>> >   Actual: Thumbnails/thumbnail.png.
>>> >
>>> > AutoDetectParser is used to parse, using a standard sequence
>>> >
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
>>> >
>>> > May be it is an auto-detection issue, the media type which is used is
>>> here:
>>> >
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
>>> >
>>> > Any hints will be appreciated
>>> >
>>> > Thanks, Sergey
>>>
>>

Re: What may have changed in ODT parser in Tika 2

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim

I gave it another try and it looks like only the thumbnail file name is
reported, `ToTextContentHandler` is used by default

I can try again with 2.4.1 RC later

Thanks, Sergey


On Sat, Apr 30, 2022 at 2:08 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim
>
> Thanks for a quick fix, missed your answer yesterday,  will check soon
> and let you know.
>
> Cheers Sergey
>
>
> On Fri 29 Apr 2022, 16:49 Tim Allison, <ta...@apache.org> wrote:
>
>> Hi Sergey,
>>   That the thumbnail file name showed up in the stream is a bug I
>> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
>> I just fixed it now (TIKA-3745).
>>   Are you not seeing "Hello Quarkus" at all, or is it just not the
>> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
>> least the 2.4.0-rc1.
>>
>> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>> >
>> > Hi Tim, All
>> >
>> > I have a simple test reading a string content from an ODT doc failing,
>> PDF,
>> > Excel are good, but something is going on with the ODT parsing.
>> >
>> > quarkus.odt in
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
>> > is expected to return a "Hello Quarkus" string
>> >
>> > but now the test fails with
>> >
>> > Expected: is "Hello Quarkus"
>> >   Actual: Thumbnails/thumbnail.png.
>> >
>> > AutoDetectParser is used to parse, using a standard sequence
>> >
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
>> >
>> > May be it is an auto-detection issue, the media type which is used is
>> here:
>> >
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
>> >
>> > Any hints will be appreciated
>> >
>> > Thanks, Sergey
>>
>

Re: What may have changed in ODT parser in Tika 2

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim

Thanks for a quick fix, missed your answer yesterday,  will check soon
and let you know.

Cheers Sergey


On Fri 29 Apr 2022, 16:49 Tim Allison, <ta...@apache.org> wrote:

> Hi Sergey,
>   That the thumbnail file name showed up in the stream is a bug I
> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
> I just fixed it now (TIKA-3745).
>   Are you not seeing "Hello Quarkus" at all, or is it just not the
> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
> least the 2.4.0-rc1.
>
> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> >
> > Hi Tim, All
> >
> > I have a simple test reading a string content from an ODT doc failing,
> PDF,
> > Excel are good, but something is going on with the ODT parsing.
> >
> > quarkus.odt in
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
> > is expected to return a "Hello Quarkus" string
> >
> > but now the test fails with
> >
> > Expected: is "Hello Quarkus"
> >   Actual: Thumbnails/thumbnail.png.
> >
> > AutoDetectParser is used to parse, using a standard sequence
> >
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
> >
> > May be it is an auto-detection issue, the media type which is used is
> here:
> >
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
> >
> > Any hints will be appreciated
> >
> > Thanks, Sergey
>

Re: What may have changed in ODT parser in Tika 2

Posted by Tim Allison <ta...@apache.org>.
Hi Sergey,
  That the thumbnail file name showed up in the stream is a bug I
introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
I just fixed it now (TIKA-3745).
  Are you not seeing "Hello Quarkus" at all, or is it just not the
only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
least the 2.4.0-rc1.

On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin <sb...@gmail.com> wrote:
>
> Hi Tim, All
>
> I have a simple test reading a string content from an ODT doc failing, PDF,
> Excel are good, but something is going on with the ODT parsing.
>
> quarkus.odt in
> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
> is expected to return a "Hello Quarkus" string
>
> but now the test fails with
>
> Expected: is "Hello Quarkus"
>   Actual: Thumbnails/thumbnail.png.
>
> AutoDetectParser is used to parse, using a standard sequence
>
> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
>
> May be it is an auto-detection issue, the media type which is used is here:
>
> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
>
> Any hints will be appreciated
>
> Thanks, Sergey