You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/10/04 21:03:00 UTC

[jira] [Commented] (TIKA-3560) Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1

    [ https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424178#comment-17424178 ] 

Tim Allison commented on TIKA-3560:
-----------------------------------

I updated the metadata section in our wiki page "migrating to tika 2.x" today.  I looked into subject, and it looks like we were putting "keywords" into subject in 1.x as well as into keywords.  We've kept that behavior in 2.x.  I'm not sure why there's an array in 2.x but not in 1.x.  Those should be the same. 

In 2.1.1-SNAPSHOT, I added empty checks for subject, keywords, title and other keys in the MSOffice parsers.  They used to allow an empty string for string based metadata values. 

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> ------------------------------------------------------------------
>
>                 Key: TIKA-3560
>                 URL: https://issues.apache.org/jira/browse/TIKA-3560
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.0.0, 2.1.0
>         Environment: Windows 10
>            Reporter: Josh Burchard
>            Priority: Major
>         Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text endpoint. I see that some metadata fields that were returned to me from Tika 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)