You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2022/06/18 16:37:56 UTC

Re: What may have changed in ODT parser in Tika 2

Hi Tim

Thanks for this fix, confirmed all works now with 2.4.1

Cheers, Sergey

On Mon, May 2, 2022 at 6:43 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim
>
> I gave it another try and it looks like only the thumbnail file name is
> reported, `ToTextContentHandler` is used by default
>
> I can try again with 2.4.1 RC later
>
> Thanks, Sergey
>
>
> On Sat, Apr 30, 2022 at 2:08 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
>> Hi Tim
>>
>> Thanks for a quick fix, missed your answer yesterday,  will check soon
>> and let you know.
>>
>> Cheers Sergey
>>
>>
>> On Fri 29 Apr 2022, 16:49 Tim Allison, <ta...@apache.org> wrote:
>>
>>> Hi Sergey,
>>>   That the thumbnail file name showed up in the stream is a bug I
>>> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
>>> I just fixed it now (TIKA-3745).
>>>   Are you not seeing "Hello Quarkus" at all, or is it just not the
>>> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
>>> least the 2.4.0-rc1.
>>>
>>> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Tim, All
>>> >
>>> > I have a simple test reading a string content from an ODT doc failing,
>>> PDF,
>>> > Excel are good, but something is going on with the ODT parsing.
>>> >
>>> > quarkus.odt in
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
>>> > is expected to return a "Hello Quarkus" string
>>> >
>>> > but now the test fails with
>>> >
>>> > Expected: is "Hello Quarkus"
>>> >   Actual: Thumbnails/thumbnail.png.
>>> >
>>> > AutoDetectParser is used to parse, using a standard sequence
>>> >
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
>>> >
>>> > May be it is an auto-detection issue, the media type which is used is
>>> here:
>>> >
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
>>> >
>>> > Any hints will be appreciated
>>> >
>>> > Thanks, Sergey
>>>
>>