You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2024/03/23 17:51:00 UTC

[jira] [Comment Edited] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

    [ https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830110#comment-17830110 ] 

Tilman Hausherr edited comment on TIKA-4171 at 3/23/24 5:50 PM:
----------------------------------------------------------------

We have a regression with the file [^876503.pdf] in the XFAExtractor class. What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is empty. Because of that, the text "Enter the full name of the conveying party or parties" is missing for the field "conname1".

I'm not saying that this is wrong, I just wonder if this is intended.


was (Author: tilman):
We have a regression with the file [^876503.pdf] in the XFAExtractor class. What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is empty. Because of that, the text "Enter the full name of the conveying party or parties" is missing for field the "conname1".

I'm not saying that this is wrong, I just wonder if this is intended.

> Tika server only returns last value for PDFs that have multiple of the same key
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-4171
>                 URL: https://issues.apache.org/jira/browse/TIKA-4171
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-server
>            Reporter: Cassandra Xia
>            Priority: Major
>             Fix For: 3.0.0-BETA, 2.9.2
>
>         Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has multiple values for the same key name, e.g. in the screenshot below line 1-7 all have the same key name. When Tika Server parses this PDF, it only returns the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in some dictionary object, so the final value is the only survivor. Would it be possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)