You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Paul Grütter <pa...@signotec.de.INVALID> on 2024/03/21 10:07:24 UTC

How to search for / extract text of form field

Hello list,

I want to search for words in a PDF document and get their positions. It seems that PDFBox ignores text which has been entered into a form field although it's rendered correctly. I can be reproduced easily with the standalone app:

java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf

The Acrobat both finds and extracts text which have been entered into a form field.

In my code I use PDFTextStripper. I haven't found any way to configure the behaviour. Is it a bug or have I overlooked something? For clarification: I don't want to search for the value ('V') but its visual representation ('AP').

Kind regards,

Dipl.-Ing. (FH) Paul Grütter
Head of Development

[Beschreibung: Beschreibung: Beschreibung: signotec_eSig_96dpi_192x44px_cmyk-]

signotec GmbH
Am Gierath 20b
40885 Ratingen (Germany)

Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39

E-Mail: paul.gruetter@signotec.de<ma...@signotec.de>
URL: www.signotec.com<http://www.signotec.com/>

Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes
[cid:image002.png@01DA7B7F.F9D1F300]<https://www.facebook.com/signotecgmbh/> [cid:image003.png@01DA7B7F.F9D1F300] <https://www.instagram.com/signotec_gmbh/>  [cid:image004.png@01DA7B7F.F9D1F300] <https://www.linkedin.com/company/signotec-gmbh/>  [cid:image005.png@01DA7B7F.F9D1F300] <https://www.xing.com/pages/signotecgmbh>  [cid:image006.png@01DA7B7F.F9D1F300] <https://www.youtube.com/user/signotec1>

[cid:image007.png@01DA7B7F.F9D1F300]<https://en.signotec.com/sustainability>

Re: How to search for / extract text of form field

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.

Am Mittwoch, dem 27.03.2024 um 08:01 +0000 schrieb Paul Grütter:
> Hello Gilad,
> 
> Thank you.
> 
> Maruan Sahyoun already contacted me with the same tip. It works fine
> but only because we use PDFBox only for rendering and text extraction
> at the moment. If we would use it for other use cases, especially for
> filling in form fields, we would have to create a copy of the
> document for text extraction which is of obviously not optimal in a
> web application that may have multiple documents open at the same
> time.

You could fill the form using PDFBox store it if you have to keep a
copy with the form fields and flatten afterwards and then do the
extraction.

Or you do the text extraction and in addition iterate the form fields
and get the content and location of the form fields widget.

BR
Maruan

> 
> Kind regards,
> 
> Dipl.-Ing. (FH) Paul Grütter
> Head of Development
> 
> 
>  
> signotec GmbH
> Am Gierath 20b
> 40885 Ratingen (Germany)
> 
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
> 
> E-Mail: paul.gruetter@signotec.de
> URL: www.signotec.com
> 
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
> 
> 
> Mit freundlichen Grüßen
> 
> Dipl.-Ing. (FH) Paul Grütter
> Leiter Entwicklung
> 
> 
>  
> signotec GmbH
> Am Gierath 20b
> 40885 Ratingen
> 
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
> 
> E-Mail: mailto:paul.gruetter@signotec.de
> URL: https://www.signotec.com/
> 
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
> 
> Von: Gilad Denneboom <gi...@gmail.com> 
> Gesendet: Sonntag, 24. März 2024 22:50
> An: paul.gruetter@signotec.de.invalid
> Cc: users@pdfbox.apache.org
> Betreff: Re: How to search for / extract text of form field
> 
> 
> Sie erhalten nicht oft eine E-Mail von
> mailto:gilad.denneboom@gmail.com.
> https://aka.ms/LearnAboutSenderIdentification
> 
> Flatten the form fields before searching the file if you want
> PDFTextStripper to find the text in them.
> 
> On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter
> <ma...@signotec.de.invalid> wrote:
> Hello list,
>  
> I want to search for words in a PDF document and get their positions.
> It seems that PDFBox ignores text which has been entered into a form
> field although it’s rendered correctly. I can be reproduced easily
> with the standalone app:
>  
> java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
> java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
>  
> The Acrobat both finds and extracts text which have been entered into
> a form field.
>  
> In my code I use PDFTextStripper. I haven’t found any way to
> configure the behaviour. Is it a bug or have I overlooked something?
> For clarification: I don’t want to search for the value (‘V’) but its
> visual representation (‘AP’).
>  
> Kind regards,
>  
> Dipl.-Ing. (FH) Paul Grütter
> Head of Development
>  
> 
>  
> signotec GmbH
> Am Gierath 20b
> 40885 Ratingen (Germany)
>  
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
>  
> E-Mail: mailto:paul.gruetter@signotec.de
> URL: http://www.signotec.com/
> 
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
>     
>  
> 
>  
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: mailto:users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: mailto:users-help@pdfbox.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

AW: How to search for / extract text of form field

Posted by Paul Grütter <pa...@signotec.de.INVALID>.

Hello Gilad,

Thank you.

Maruan Sahyoun already contacted me with the same tip. It works fine but only because we use PDFBox only for rendering and text extraction at the moment. If we would use it for other use cases, especially for filling in form fields, we would have to create a copy of the document for text extraction which is of obviously not optimal in a web application that may have multiple documents open at the same time.

Kind regards,

Dipl.-Ing. (FH) Paul Grütter
Head of Development


 
signotec GmbH
Am Gierath 20b
40885 Ratingen (Germany)

Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39

E-Mail: paul.gruetter@signotec.de
URL: www.signotec.com

Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes


Mit freundlichen Grüßen

Dipl.-Ing. (FH) Paul Grütter
Leiter Entwicklung


 
signotec GmbH
Am Gierath 20b
40885 Ratingen

Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39

E-Mail: mailto:paul.gruetter@signotec.de
URL: https://www.signotec.com/

Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes

Von: Gilad Denneboom <gi...@gmail.com> 
Gesendet: Sonntag, 24. März 2024 22:50
An: paul.gruetter@signotec.de.invalid
Cc: users@pdfbox.apache.org
Betreff: Re: How to search for / extract text of form field


Sie erhalten nicht oft eine E-Mail von mailto:gilad.denneboom@gmail.com. https://aka.ms/LearnAboutSenderIdentification

Flatten the form fields before searching the file if you want PDFTextStripper to find the text in them.

On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter <ma...@signotec.de.invalid> wrote:
Hello list,
 
I want to search for words in a PDF document and get their positions. It seems that PDFBox ignores text which has been entered into a form field although it’s rendered correctly. I can be reproduced easily with the standalone app:
 
java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
 
The Acrobat both finds and extracts text which have been entered into a form field.
 
In my code I use PDFTextStripper. I haven’t found any way to configure the behaviour. Is it a bug or have I overlooked something? For clarification: I don’t want to search for the value (‘V’) but its visual representation (‘AP’).
 
Kind regards,
 
Dipl.-Ing. (FH) Paul Grütter
Head of Development
 

 
signotec GmbH
Am Gierath 20b
40885 Ratingen (Germany)
 
Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39
 
E-Mail: mailto:paul.gruetter@signotec.de
URL: http://www.signotec.com/

Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes
    
 

 

---------------------------------------------------------------------
To unsubscribe, e-mail: mailto:users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: mailto:users-help@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: How to search for / extract text of form field

Posted by Gilad Denneboom <gi...@gmail.com>.

Flatten the form fields before searching the file if you want
PDFTextStripper to find the text in them.

On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter
<pa...@signotec.de.invalid> wrote:

> Hello list,
>
>
>
> I want to search for words in a PDF document and get their positions. It
> seems that PDFBox ignores text which has been entered into a form field
> although it’s rendered correctly. I can be reproduced easily with the
> standalone app:
>
>
>
> java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
>
> java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
>
>
>
> The Acrobat both finds and extracts text which have been entered into a
> form field.
>
>
>
> In my code I use PDFTextStripper. I haven’t found any way to configure the
> behaviour. Is it a bug or have I overlooked something? For clarification: I
> don’t want to search for the value (‘V’) but its visual representation
> (‘AP’).
>
>
>
> Kind regards,
>
>
>
> Dipl.-Ing. (FH) Paul Grütter
>
> Head of Development
>
>
>
> *[image: Beschreibung: Beschreibung: Beschreibung:
> signotec_eSig_96dpi_192x44px_cmyk-]*
>
>
>
> *signotec GmbH*
>
> Am Gierath 20b
> 40885 Ratingen (Germany)
>
>
>
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
>
>
>
> E-Mail: paul.gruetter@signotec.de
>
> URL: www.signotec.com
>
>
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
>
> <https://www.facebook.com/signotecgmbh/>
> <https://www.instagram.com/signotec_gmbh/>
> <https://www.linkedin.com/company/signotec-gmbh/>
> <https://www.xing.com/pages/signotecgmbh>
> <https://www.youtube.com/user/signotec1>
>
>
>
> <https://en.signotec.com/sustainability>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>