You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Tilman Hausherr <TH...@t-online.de> on 2021/09/01 18:37:00 UTC

Re: Form fields and other issues with PDF files

Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:
>
> Hmm, you’re right.  I tried it on another form (The downloadable 1040 
> from irs.gov) and it does list the values of the form fields (of 
> course, you’d have to do the mapping yourself, so you know that field 
> f1_02 is first name)
>
> But it didn’t work on the sample file I had, which unfortunately, I 
> can’t share.
>
> It’s definitely not a scanned file.  What are the requirements for 
> allowing this to happened?  Is there a way to convert a PDF to XFA?
>
Not in PDFBox nor in Tika. XFA is a deprecated format that isn't really 
part of PDF.

Tilman



> <div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>
>
>             <li fieldName="f1_01">f1_01: </li>
>
>             <li fieldName="f1_02">f1_02: John</li>
>
>             <li fieldName="f1_03">f1_03: Smith</li>
>
>             <li fieldName="f1_04">f1_04: </li>
>
>             <li fieldName="f1_05">f1_05: </li>
>
>             <li fieldName="f1_06">f1_06: </li>
>
>             <li fieldName="f1_07">f1_07: </li>
>
>             <li fieldName="f1_08">f1_08: 123 Main St</li>
>
>             <li fieldName="f1_09">f1_09: </li>
>
>             <li fieldName="f1_10">f1_10: </li>
>
> *Peter Kronenberg****| **Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> Torch AI <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Monday, August 30, 2021 3:01 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
> Sorry for not responding sooner.
>
> >When extracting text from PDF files (no OCR), there doesn’t seem to 
> be any way to link the text that was filled in with the name of the 
> form field.   For example, if there is a field marked ‘First Name’ and 
> the user fills that in, they likely appear on different lines and 
> different places, with no way to associate them.  Is there any way to 
> do this?
>
> Can you share an example file?  I thought we were marking field names 
> and contents with <div> elements for AcroForms.  If you're processing 
> XFA, I'm pretty sure we try to associate form keys and values.
>
> If the form is not well put together or if it is a scan of a form, 
> there's not much we can do.
>
> On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg 
> <peter.kronenberg@torch.ai <ma...@torch.ai>> wrote:
>
>     Is this capability of associating form fields with their data
>     something that PDF Box doesn’t even support? Just want to
>     understand if it’s just the capability of Tika or if PDFBox
>     doesn’t even have a way to do it
>
>     *Peter Kronenberg****| **Senior AI Analytic ENGINEER *
>
>     *C: 703.887.5623 *
>
>     Torch AI
>     <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
>
>
>     4303 W. 119th St., Leawood, KS 66209
>     WWW.TORCH.AI
>     <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
>
>     *From:* Tilman Hausherr <THausherr@t-online.de
>     <ma...@t-online.de>>
>     *Sent:* Monday, August 30, 2021 2:34 PM
>     *To:* user@tika.apache.org <ma...@tika.apache.org>
>     *Subject:* Re: Form fields and other issues with PDF files
>
>     Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:
>
>         Is there a way to extract just the form field data?  That way,
>         if it was a known form, it might be easier to match up the
>         responses with the fields they belong to
>
>     I looked at PDFParserConfig and didn't find such an option. Even
>     if there was, I doubt it would help match.
>
>     Tilman
>
>         I’ll take a look at Tabula for the tables
>
>         *Peter Kronenberg****| **Senior AI Analytic ENGINEER *
>
>         *C: 703.887.5623 *
>
>         Torch AI
>         <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
>
>         4303 W. 119th St., Leawood, KS 66209
>         WWW.TORCH.AI
>         <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
>         *From:* Tilman Hausherr <TH...@t-online.de>
>         <ma...@t-online.de>
>         *Sent:* Saturday, August 28, 2021 5:38 AM
>         *To:* user@tika.apache.org <ma...@tika.apache.org>
>         *Subject:* Re: Form fields and other issues with PDF files
>
>         Field texts: there is no formal way to do this in the PDF
>         specification.
>
>         Tables: try Tabula, they use heuristics
>
>         Strike-out text: one is a font, the other a vector graphic (or
>         an annotation). So it's not connected. One would have to write
>         an algorithm.
>
>         Tilman
>
>         Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:
>
>              1. When extracting text from PDF files (no OCR), there
>                 doesn’t seem to be any way to link the text that was
>                 filled in with the name of the form field.   For
>                 example, if there is a field marked ‘First Name’ and
>                 the user fills that in, they likely appear on
>                 different lines and different places, with no way to
>                 associate them.  Is there any way to do this?
>
>              2. It’s also sometimes difficult to figure out how tables
>                 are extracted.  If I have a 2 column table, it seems
>                 to ignore the tabular format and just extract text
>                 line by line.  In this example (ignoring the
>                 hand-written text), it gets extracted as
>                 ‘Comprehensive General Liability (including, if $2.0
>                 million’
>
>              3. Deleted, or strike-out text, is extracted with no
>                 indication
>
>             *Peter Kronenberg****| **Senior AI Analytic ENGINEER *
>
>             *C: 703.887.5623 *
>
>             Torch AI
>             <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
>
>             4303 W. 119th St., Leawood, KS 66209
>             WWW.TORCH.AI
>             <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>


Re: Form fields and other issues with PDF files

Posted by Peter Kronenberg <pe...@torch.ai>.
Thank you. I'll take a look


________________________________
From: Joe Wicentowski <jo...@gmail.com>
Sent: Saturday, September 4, 2021 1:24:46 AM
To: user@tika.apache.org <us...@tika.apache.org>
Cc: tallison@apache.org <ta...@apache.org>
Subject: Re: Form fields and other issues with PDF files


This isn't a Tika-specific suggestion, but I wrote a blog post about PDF forms a few years ago, and the resources/processes described there may be useful:

Filling PDF forms with PDFtk, XFDF, and XQuery
https://joewiz.org/2014/02/13/filling-pdf-forms-with-pdftk-xfdf-and-xquery/<https://us-east-2.protection.sophos.com?d=joewiz.org&u=aHR0cHM6Ly9qb2V3aXoub3JnLzIwMTQvMDIvMTMvZmlsbGluZy1wZGYtZm9ybXMtd2l0aC1wZGZ0ay14ZmRmLWFuZC14cXVlcnkv&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=UUtLR1hhdFB3bHlGS0tUL3hCR1RhbkRJajUzcGhuUm5KNDNmK2JpL2ZLcz0=&h=fae43d9f355e4fbeac8ff7bc8d20099d>

On Fri, Sep 3, 2021 at 4:08 PM Peter Kronenberg <pe...@torch.ai>> wrote:

I know nothing about building PDFs, or Acroforms or XFA or any of that stuff 😊.  Is there an easy guide on how to build PDF forms so that they satisfy these requirements, that I can pass on to my client?



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=fae43d9f355e4fbeac8ff7bc8d20099d>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=fae43d9f355e4fbeac8ff7bc8d20099d>





From: Tim Allison <ta...@apache.org>>
Sent: Thursday, September 2, 2021 4:32 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



I'd trust what Tilman has to say.



It is hard to associate acroform data with locations on pages because it is a separate entity in the PDF.



From the one test file we have in our unit tests, I see the following.  This is not xfa.  This suggests that the forms at least, no matter how they are spread on the pages _should_ have reasonable markup for actual acroforms.  Obv, if someone has just visually done a form without an acroform, ymmv...



<div class="acroform"><ol> <li>aTextField: TIKA-1226</li>
<li>aCheckBox: Yes</li>
<li>aComboBox: [comboExportB]</li>
<li>aListBox: [exportListItemC]</li>
</ol>
</div>



for this PDF

[cid:17baf43160a5b16b22]



On Thu, Sep 2, 2021 at 11:33 AM Peter Kronenberg <pe...@torch.ai>> wrote:

So XFA is deprecated, yet that's the only thing  that makes this work?  Does that mean this is less likely to work with newer formats

@tallison@apache.org<ma...@apache.org>  is there another way Tika can do this for a non-XFA forms ?





________________________________

From: Tilman Hausherr <TH...@t-online.de>>
Sent: Wednesday, September 1, 2021 2:37:00 PM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Form fields and other issues with PDF files



Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:

Hmm, you’re right.  I tried it on another form (The downloadable 1040 from irs.gov<https://us-east-2.protection.sophos.com?d=irs.gov&u=aHR0cDovL2lycy5nb3Y=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=bUhYSG5uOW9FZno2bFFFb0E1NkR4ZkdHZFBUYnRYeHA1NmR3Z1BEMXJmRT0=&h=a7eb75c282c14594b381c38e873a180d>) and it does list the values of the form fields (of course, you’d have to do the mapping yourself, so you know that field f1_02 is first name)

But it didn’t work on the sample file I had, which unfortunately, I can’t share.

It’s definitely not a scanned file.  What are the requirements for allowing this to happened?  Is there a way to convert a PDF to XFA?

Not in PDFBox nor in Tika. XFA is a deprecated format that isn't really part of PDF.

Tilman









<div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>

            <li fieldName="f1_01">f1_01: </li>

            <li fieldName="f1_02">f1_02: John</li>

            <li fieldName="f1_03">f1_03: Smith</li>

            <li fieldName="f1_04">f1_04: </li>

            <li fieldName="f1_05">f1_05: </li>

            <li fieldName="f1_06">f1_06: </li>

            <li fieldName="f1_07">f1_07: </li>

            <li fieldName="f1_08">f1_08: 123 Main St</li>

            <li fieldName="f1_09">f1_09: </li>

            <li fieldName="f1_10">f1_10: </li>



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>





From: Tim Allison <ta...@apache.org>
Sent: Monday, August 30, 2021 3:01 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Sorry for not responding sooner.



>When extracting text from PDF files (no OCR), there doesn’t seem to be any way to link the text that was filled in with the name of the form field.   For example, if there is a field marked ‘First Name’ and the user fills that in, they likely appear on different lines and different places, with no way to associate them.  Is there any way to do this?



Can you share an example file?  I thought we were marking field names and contents with <div> elements for AcroForms.  If you're processing XFA, I'm pretty sure we try to associate form keys and values.



If the form is not well put together or if it is a scan of a form, there's not much we can do.





On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg <pe...@torch.ai>> wrote:

Is this capability of associating form fields with their data something that PDF Box doesn’t even support?  Just want to understand if it’s just the capability of Tika or if PDFBox doesn’t even have a way to do it



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>





From: Tilman Hausherr <TH...@t-online.de>>
Sent: Monday, August 30, 2021 2:34 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:

Is there a way to extract just the form field data?  That way, if it was a known form, it might be easier to match up the responses with the fields they belong to

I looked at PDFParserConfig and didn't find such an option. Even if there was, I doubt it would help match.

Tilman





I’ll take a look at Tabula for the tables



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>





From: Tilman Hausherr <TH...@t-online.de>
Sent: Saturday, August 28, 2021 5:38 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Field texts: there is no formal way to do this in the PDF specification.

Tables: try Tabula, they use heuristics

Strike-out text: one is a font, the other a vector graphic (or an annotation). So it's not connected. One would have to write an algorithm.

Tilman



Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:

  1.  When extracting text from PDF files (no OCR), there doesn’t seem to be any way to link the text that was filled in with the name of the form field.   For example, if there is a field marked ‘First Name’ and the user fills that in, they likely appear on different lines and different places, with no way to associate them.  Is there any way to do this?





  1.  It’s also sometimes difficult to figure out how tables are extracted.  If I have a 2 column table, it seems to ignore the tabular format and just extract text line by line.  In this example (ignoring the hand-written text), it gets extracted as ‘Comprehensive General Liability (including, if $2.0 million’



[cid:17baf43160a692e333]







  1.  Deleted, or strike-out text, is extracted with no indication

[cid:17baf43160a7745b44]



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>











Re: Form fields and other issues with PDF files

Posted by Joe Wicentowski <jo...@gmail.com>.
This isn't a Tika-specific suggestion, but I wrote a blog post about PDF
forms a few years ago, and the resources/processes described there may be
useful:

Filling PDF forms with PDFtk, XFDF, and XQuery
https://joewiz.org/2014/02/13/filling-pdf-forms-with-pdftk-xfdf-and-xquery/

On Fri, Sep 3, 2021 at 4:08 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> I know nothing about building PDFs, or Acroforms or XFA or any of that
> stuff 😊.  Is there an easy guide on how to build PDF forms so that they
> satisfy these requirements, that I can pass on to my client?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Thursday, September 2, 2021 4:32 PM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> I'd trust what Tilman has to say.
>
>
>
> It is hard to associate acroform data with locations on pages because it
> is a separate entity in the PDF.
>
>
>
> From the one test file we have in our unit tests, I see the following.
> This is not xfa.  This suggests that the forms at least, no matter how they
> are spread on the pages _should_ have reasonable markup for actual
> acroforms.  Obv, if someone has just visually done a form without an
> acroform, ymmv...
>
>
>
> <div class="acroform"><ol> <li>aTextField: TIKA-1226</li>
> <li>aCheckBox: Yes</li>
> <li>aComboBox: [comboExportB]</li>
> <li>aListBox: [exportListItemC]</li>
> </ol>
> </div>
>
>
>
> for this PDF
>
>
>
> On Thu, Sep 2, 2021 at 11:33 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> So XFA is deprecated, yet that's the only thing  that makes this work?
> Does that mean this is less likely to work with newer formats
>
> @tallison@apache.org <ta...@apache.org>  is there another way Tika can
> do this for a non-XFA forms ?
>
>
>
>
> ------------------------------
>
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Wednesday, September 1, 2021 2:37:00 PM
> *To:* user@tika.apache.org <us...@tika.apache.org>
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:
>
> Hmm, you’re right.  I tried it on another form (The downloadable 1040 from
> irs.gov
> <https://us-east-2.protection.sophos.com?d=irs.gov&u=aHR0cDovL2lycy5nb3Y=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=bUhYSG5uOW9FZno2bFFFb0E1NkR4ZkdHZFBUYnRYeHA1NmR3Z1BEMXJmRT0=&h=a7eb75c282c14594b381c38e873a180d>)
> and it does list the values of the form fields (of course, you’d have to do
> the mapping yourself, so you know that field f1_02 is first name)
>
> But it didn’t work on the sample file I had, which unfortunately, I can’t
> share.
>
> It’s definitely not a scanned file.  What are the requirements for
> allowing this to happened?  Is there a way to convert a PDF to XFA?
>
> Not in PDFBox nor in Tika. XFA is a deprecated format that isn't really
> part of PDF.
>
> Tilman
>
>
>
>
>
>
>
>
>
> <div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>
>
>             <li fieldName="f1_01">f1_01: </li>
>
>             <li fieldName="f1_02">f1_02: John</li>
>
>             <li fieldName="f1_03">f1_03: Smith</li>
>
>             <li fieldName="f1_04">f1_04: </li>
>
>             <li fieldName="f1_05">f1_05: </li>
>
>             <li fieldName="f1_06">f1_06: </li>
>
>             <li fieldName="f1_07">f1_07: </li>
>
>             <li fieldName="f1_08">f1_08: 123 Main St</li>
>
>             <li fieldName="f1_09">f1_09: </li>
>
>             <li fieldName="f1_10">f1_10: </li>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>
>
>
>
>
>
> *From:* Tim Allison <ta...@apache.org> <ta...@apache.org>
> *Sent:* Monday, August 30, 2021 3:01 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Sorry for not responding sooner.
>
>
>
> >When extracting text from PDF files (no OCR), there doesn’t seem to be
> any way to link the text that was filled in with the name of the form
> field.   For example, if there is a field marked ‘First Name’ and the user
> fills that in, they likely appear on different lines and different places,
> with no way to associate them.  Is there any way to do this?
>
>
>
> Can you share an example file?  I thought we were marking field names and
> contents with <div> elements for AcroForms.  If you're processing XFA, I'm
> pretty sure we try to associate form keys and values.
>
>
>
> If the form is not well put together or if it is a scan of a form, there's
> not much we can do.
>
>
>
>
>
> On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> Is this capability of associating form fields with their data something
> that PDF Box doesn’t even support?  Just want to understand if it’s just
> the capability of Tika or if PDFBox doesn’t even have a way to do it
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
>
>
>
>
>
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Monday, August 30, 2021 2:34 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:
>
> Is there a way to extract just the form field data?  That way, if it was a
> known form, it might be easier to match up the responses with the fields
> they belong to
>
> I looked at PDFParserConfig and didn't find such an option. Even if there
> was, I doubt it would help match.
>
> Tilman
>
>
>
>
>
> I’ll take a look at Tabula for the tables
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
>
>
>
>
> *From:* Tilman Hausherr <TH...@t-online.de> <TH...@t-online.de>
> *Sent:* Saturday, August 28, 2021 5:38 AM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Field texts: there is no formal way to do this in the PDF specification.
>
> Tables: try Tabula, they use heuristics
>
> Strike-out text: one is a font, the other a vector graphic (or an
> annotation). So it's not connected. One would have to write an algorithm.
>
> Tilman
>
>
>
> Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:
>
>
>    1. When extracting text from PDF files (no OCR), there doesn’t seem to
>    be any way to link the text that was filled in with the name of the form
>    field.   For example, if there is a field marked ‘First Name’ and the user
>    fills that in, they likely appear on different lines and different places,
>    with no way to associate them.  Is there any way to do this?
>
>
>
>
>
>    1. It’s also sometimes difficult to figure out how tables are
>    extracted.  If I have a 2 column table, it seems to ignore the tabular
>    format and just extract text line by line.  In this example (ignoring the
>    hand-written text), it gets extracted as ‘Comprehensive General Liability
>    (including, if $2.0 million’
>
>
>
>
>
>
>
>
>
>    1. Deleted, or strike-out text, is extracted with no indication
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Form fields and other issues with PDF files

Posted by Peter Kronenberg <pe...@torch.ai>.
I know nothing about building PDFs, or Acroforms or XFA or any of that stuff 😊.  Is there an easy guide on how to build PDF forms so that they satisfy these requirements, that I can pass on to my client?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>
Sent: Thursday, September 2, 2021 4:32 PM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org
Subject: Re: Form fields and other issues with PDF files


I'd trust what Tilman has to say.

It is hard to associate acroform data with locations on pages because it is a separate entity in the PDF.

From the one test file we have in our unit tests, I see the following.  This is not xfa.  This suggests that the forms at least, no matter how they are spread on the pages _should_ have reasonable markup for actual acroforms.  Obv, if someone has just visually done a form without an acroform, ymmv...

<div class="acroform"><ol> <li>aTextField: TIKA-1226</li>
<li>aCheckBox: Yes</li>
<li>aComboBox: [comboExportB]</li>
<li>aListBox: [exportListItemC]</li>
</ol>
</div>

for this PDF
[cid:image002.png@01D7A0DD.F04577A0]

On Thu, Sep 2, 2021 at 11:33 AM Peter Kronenberg <pe...@torch.ai>> wrote:
So XFA is deprecated, yet that's the only thing  that makes this work?  Does that mean this is less likely to work with newer formats
@tallison@apache.org<ma...@apache.org>  is there another way Tika can do this for a non-XFA forms ?


________________________________
From: Tilman Hausherr <TH...@t-online.de>>
Sent: Wednesday, September 1, 2021 2:37:00 PM
To: user@tika.apache.org<ma...@tika.apache.org> <us...@tika.apache.org>>
Subject: Re: Form fields and other issues with PDF files

Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:

Hmm, you’re right.  I tried it on another form (The downloadable 1040 from irs.gov<https://us-east-2.protection.sophos.com?d=irs.gov&u=aHR0cDovL2lycy5nb3Y=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=bUhYSG5uOW9FZno2bFFFb0E1NkR4ZkdHZFBUYnRYeHA1NmR3Z1BEMXJmRT0=&h=a7eb75c282c14594b381c38e873a180d>) and it does list the values of the form fields (of course, you’d have to do the mapping yourself, so you know that field f1_02 is first name)

But it didn’t work on the sample file I had, which unfortunately, I can’t share.

It’s definitely not a scanned file.  What are the requirements for allowing this to happened?  Is there a way to convert a PDF to XFA?

Not in PDFBox nor in Tika. XFA is a deprecated format that isn't really part of PDF.

Tilman









<div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>

            <li fieldName="f1_01">f1_01: </li>

            <li fieldName="f1_02">f1_02: John</li>

            <li fieldName="f1_03">f1_03: Smith</li>

            <li fieldName="f1_04">f1_04: </li>

            <li fieldName="f1_05">f1_05: </li>

            <li fieldName="f1_06">f1_06: </li>

            <li fieldName="f1_07">f1_07: </li>

            <li fieldName="f1_08">f1_08: 123 Main St</li>

            <li fieldName="f1_09">f1_09: </li>

            <li fieldName="f1_10">f1_10: </li>



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>





From: Tim Allison <ta...@apache.org>
Sent: Monday, August 30, 2021 3:01 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Sorry for not responding sooner.



>When extracting text from PDF files (no OCR), there doesn’t seem to be any way to link the text that was filled in with the name of the form field.   For example, if there is a field marked ‘First Name’ and the user fills that in, they likely appear on different lines and different places, with no way to associate them.  Is there any way to do this?



Can you share an example file?  I thought we were marking field names and contents with <div> elements for AcroForms.  If you're processing XFA, I'm pretty sure we try to associate form keys and values.



If the form is not well put together or if it is a scan of a form, there's not much we can do.





On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg <pe...@torch.ai>> wrote:

Is this capability of associating form fields with their data something that PDF Box doesn’t even support?  Just want to understand if it’s just the capability of Tika or if PDFBox doesn’t even have a way to do it



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>





From: Tilman Hausherr <TH...@t-online.de>>
Sent: Monday, August 30, 2021 2:34 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:

Is there a way to extract just the form field data?  That way, if it was a known form, it might be easier to match up the responses with the fields they belong to

I looked at PDFParserConfig and didn't find such an option. Even if there was, I doubt it would help match.

Tilman





I’ll take a look at Tabula for the tables



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>





From: Tilman Hausherr <TH...@t-online.de>
Sent: Saturday, August 28, 2021 5:38 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Field texts: there is no formal way to do this in the PDF specification.

Tables: try Tabula, they use heuristics

Strike-out text: one is a font, the other a vector graphic (or an annotation). So it's not connected. One would have to write an algorithm.

Tilman



Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:

  1.  When extracting text from PDF files (no OCR), there doesn’t seem to be any way to link the text that was filled in with the name of the form field.   For example, if there is a field marked ‘First Name’ and the user fills that in, they likely appear on different lines and different places, with no way to associate them.  Is there any way to do this?





  1.  It’s also sometimes difficult to figure out how tables are extracted.  If I have a 2 column table, it seems to ignore the tabular format and just extract text line by line.  In this example (ignoring the hand-written text), it gets extracted as ‘Comprehensive General Liability (including, if $2.0 million’



[cid:image003.png@01D7A0DD.F04577A0]







  1.  Deleted, or strike-out text, is extracted with no indication

[cid:image004.png@01D7A0DD.F04577A0]



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>











Re: Form fields and other issues with PDF files

Posted by Tim Allison <ta...@apache.org>.
I'd trust what Tilman has to say.

It is hard to associate acroform data with locations on pages because it is
a separate entity in the PDF.

From the one test file we have in our unit tests, I see the following.
This is not xfa.  This suggests that the forms at least, no matter how they
are spread on the pages _should_ have reasonable markup for actual
acroforms.  Obv, if someone has just visually done a form without an
acroform, ymmv...

<div class="acroform"><ol> <li>aTextField: TIKA-1226</li>
<li>aCheckBox: Yes</li>
<li>aComboBox: [comboExportB]</li>
<li>aListBox: [exportListItemC]</li>
</ol>
</div>

for this PDF
[image: Screen Shot 2021-09-02 at 4.29.41 PM.png]

On Thu, Sep 2, 2021 at 11:33 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> So XFA is deprecated, yet that's the only thing  that makes this work?
> Does that mean this is less likely to work with newer formats
> @tallison@apache.org <ta...@apache.org>  is there another way Tika can
> do this for a non-XFA forms ?
>
>
> ------------------------------
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Wednesday, September 1, 2021 2:37:00 PM
> *To:* user@tika.apache.org <us...@tika.apache.org>
> *Subject:* Re: Form fields and other issues with PDF files
>
> Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:
>
> Hmm, you’re right.  I tried it on another form (The downloadable 1040 from
> irs.gov) and it does list the values of the form fields (of course, you’d
> have to do the mapping yourself, so you know that field f1_02 is first
> name)
>
> But it didn’t work on the sample file I had, which unfortunately, I can’t
> share.
>
> It’s definitely not a scanned file.  What are the requirements for
> allowing this to happened?  Is there a way to convert a PDF to XFA?
>
> Not in PDFBox nor in Tika. XFA is a deprecated format that isn't really
> part of PDF.
>
> Tilman
>
>
>
>
>
>
>
> <div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>
>
>             <li fieldName="f1_01">f1_01: </li>
>
>             <li fieldName="f1_02">f1_02: John</li>
>
>             <li fieldName="f1_03">f1_03: Smith</li>
>
>             <li fieldName="f1_04">f1_04: </li>
>
>             <li fieldName="f1_05">f1_05: </li>
>
>             <li fieldName="f1_06">f1_06: </li>
>
>             <li fieldName="f1_07">f1_07: </li>
>
>             <li fieldName="f1_08">f1_08: 123 Main St</li>
>
>             <li fieldName="f1_09">f1_09: </li>
>
>             <li fieldName="f1_10">f1_10: </li>
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>
>
>
>
>
>
> *From:* Tim Allison <ta...@apache.org> <ta...@apache.org>
> *Sent:* Monday, August 30, 2021 3:01 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Sorry for not responding sooner.
>
>
>
> >When extracting text from PDF files (no OCR), there doesn’t seem to be
> any way to link the text that was filled in with the name of the form
> field.   For example, if there is a field marked ‘First Name’ and the user
> fills that in, they likely appear on different lines and different places,
> with no way to associate them.  Is there any way to do this?
>
>
>
> Can you share an example file?  I thought we were marking field names and
> contents with <div> elements for AcroForms.  If you're processing XFA, I'm
> pretty sure we try to associate form keys and values.
>
>
>
> If the form is not well put together or if it is a scan of a form, there's
> not much we can do.
>
>
>
>
>
> On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> Is this capability of associating form fields with their data something
> that PDF Box doesn’t even support?  Just want to understand if it’s just
> the capability of Tika or if PDFBox doesn’t even have a way to do it
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>
>
>
>
>
>
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Monday, August 30, 2021 2:34 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:
>
> Is there a way to extract just the form field data?  That way, if it was a
> known form, it might be easier to match up the responses with the fields
> they belong to
>
> I looked at PDFParserConfig and didn't find such an option. Even if there
> was, I doubt it would help match.
>
> Tilman
>
>
>
>
>
> I’ll take a look at Tabula for the tables
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>
>
>
>
>
>
> *From:* Tilman Hausherr <TH...@t-online.de> <TH...@t-online.de>
> *Sent:* Saturday, August 28, 2021 5:38 AM
> *To:* user@tika.apache.org
> *Subject:* Re: Form fields and other issues with PDF files
>
>
>
> Field texts: there is no formal way to do this in the PDF specification.
>
> Tables: try Tabula, they use heuristics
>
> Strike-out text: one is a font, the other a vector graphic (or an
> annotation). So it's not connected. One would have to write an algorithm.
>
> Tilman
>
>
>
> Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:
>
>
>    1. When extracting text from PDF files (no OCR), there doesn’t seem to
>    be any way to link the text that was filled in with the name of the form
>    field.   For example, if there is a field marked ‘First Name’ and the user
>    fills that in, they likely appear on different lines and different places,
>    with no way to associate them.  Is there any way to do this?
>
>
>
>
>
>    1. It’s also sometimes difficult to figure out how tables are
>    extracted.  If I have a 2 column table, it seems to ignore the tabular
>    format and just extract text line by line.  In this example (ignoring the
>    hand-written text), it gets extracted as ‘Comprehensive General Liability
>    (including, if $2.0 million’
>
>
>
>
>
>
>
>
>
>    1. Deleted, or strike-out text, is extracted with no indication
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>
>
>
>
>
>
>
>
>
>
>
>

Re: Form fields and other issues with PDF files

Posted by Peter Kronenberg <pe...@torch.ai>.
So XFA is deprecated, yet that's the only thing  that makes this work?  Does that mean this is less likely to work with newer formats
@tallison@apache.org<ma...@apache.org>  is there another way Tika can do this for a non-XFA forms?


________________________________
From: Tilman Hausherr <TH...@t-online.de>
Sent: Wednesday, September 1, 2021 2:37:00 PM
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files


Am 30.08.2021 um 21:26 schrieb Peter Kronenberg:

Hmm, you’re right.  I tried it on another form (The downloadable 1040 from irs.gov) and it does list the values of the form fields (of course, you’d have to do the mapping yourself, so you know that field f1_02 is first name)

But it didn’t work on the sample file I had, which unfortunately, I can’t share.

It’s definitely not a scanned file.  What are the requirements for allowing this to happened?  Is there a way to convert a PDF to XFA?

Not in PDFBox nor in Tika. XFA is a deprecated format that isn't really part of PDF.

Tilman







<div class="xfa_form"><ol>    <li fieldName="c1_01">c1_01: 0</li>

            <li fieldName="f1_01">f1_01: </li>

            <li fieldName="f1_02">f1_02: John</li>

            <li fieldName="f1_03">f1_03: Smith</li>

            <li fieldName="f1_04">f1_04: </li>

            <li fieldName="f1_05">f1_05: </li>

            <li fieldName="f1_06">f1_06: </li>

            <li fieldName="f1_07">f1_07: </li>

            <li fieldName="f1_08">f1_08: 123 Main St</li>

            <li fieldName="f1_09">f1_09: </li>

            <li fieldName="f1_10">f1_10: </li>



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=caf2ab49be2c484bb9beb236ff7aab54>





From: Tim Allison <ta...@apache.org>
Sent: Monday, August 30, 2021 3:01 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Sorry for not responding sooner.



>When extracting text from PDF files (no OCR), there doesn’t seem to be any way to link the text that was filled in with the name of the form field.   For example, if there is a field marked ‘First Name’ and the user fills that in, they likely appear on different lines and different places, with no way to associate them.  Is there any way to do this?



Can you share an example file?  I thought we were marking field names and contents with <div> elements for AcroForms.  If you're processing XFA, I'm pretty sure we try to associate form keys and values.



If the form is not well put together or if it is a scan of a form, there's not much we can do.





On Mon, Aug 30, 2021 at 2:40 PM Peter Kronenberg <pe...@torch.ai>> wrote:

Is this capability of associating form fields with their data something that PDF Box doesn’t even support?  Just want to understand if it’s just the capability of Tika or if PDFBox doesn’t even have a way to do it



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=880ef251683440fd85f0f0ceeef1e4dc>





From: Tilman Hausherr <TH...@t-online.de>>
Sent: Monday, August 30, 2021 2:34 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Am 30.08.2021 um 18:47 schrieb Peter Kronenberg:

Is there a way to extract just the form field data?  That way, if it was a known form, it might be easier to match up the responses with the fields they belong to

I looked at PDFParserConfig and didn't find such an option. Even if there was, I doubt it would help match.

Tilman





I’ll take a look at Tabula for the tables



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=d27f265985a64c879cc9e10d07c3b47f>





From: Tilman Hausherr <TH...@t-online.de>
Sent: Saturday, August 28, 2021 5:38 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Form fields and other issues with PDF files



Field texts: there is no formal way to do this in the PDF specification.

Tables: try Tabula, they use heuristics

Strike-out text: one is a font, the other a vector graphic (or an annotation). So it's not connected. One would have to write an algorithm.

Tilman



Am 27.08.2021 um 16:39 schrieb Peter Kronenberg:

  1.  When extracting text from PDF files (no OCR), there doesn’t seem to be any way to link the text that was filled in with the name of the form field.   For example, if there is a field marked ‘First Name’ and the user fills that in, they likely appear on different lines and different places, with no way to associate them.  Is there any way to do this?





  1.  It’s also sometimes difficult to figure out how tables are extracted.  If I have a 2 column table, it seems to ignore the tabular format and just extract text line by line.  In this example (ignoring the hand-written text), it gets extracted as ‘Comprehensive General Liability (including, if $2.0 million’



[cid:part15.806C8847.9F5BFDB4@t-online.de]







  1.  Deleted, or strike-out text, is extracted with no indication

[cid:part16.30579392.65885908@t-online.de]



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=071c2996cef64caba824e1f8ebe5dae4>