You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Paul Grütter <pa...@signotec.de.INVALID> on 2024/03/21 10:07:24 UTC
How to search for / extract text of form field
Hello list,
I want to search for words in a PDF document and get their positions. It seems that PDFBox ignores text which has been entered into a form field although it's rendered correctly. I can be reproduced easily with the standalone app:
java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
The Acrobat both finds and extracts text which have been entered into a form field.
In my code I use PDFTextStripper. I haven't found any way to configure the behaviour. Is it a bug or have I overlooked something? For clarification: I don't want to search for the value ('V') but its visual representation ('AP').
Kind regards,
Dipl.-Ing. (FH) Paul Grütter
Head of Development
[Beschreibung: Beschreibung: Beschreibung: signotec_eSig_96dpi_192x44px_cmyk-]
signotec GmbH
Am Gierath 20b
40885 Ratingen (Germany)
Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39
E-Mail: paul.gruetter@signotec.de<ma...@signotec.de>
URL: www.signotec.com<http://www.signotec.com/>
Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes
[cid:image002.png@01DA7B7F.F9D1F300]<https://www.facebook.com/signotecgmbh/> [cid:image003.png@01DA7B7F.F9D1F300] <https://www.instagram.com/signotec_gmbh/> [cid:image004.png@01DA7B7F.F9D1F300] <https://www.linkedin.com/company/signotec-gmbh/> [cid:image005.png@01DA7B7F.F9D1F300] <https://www.xing.com/pages/signotecgmbh> [cid:image006.png@01DA7B7F.F9D1F300] <https://www.youtube.com/user/signotec1>
[cid:image007.png@01DA7B7F.F9D1F300]<https://en.signotec.com/sustainability>
Re: How to search for / extract text of form field
Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.
Am Mittwoch, dem 27.03.2024 um 08:01 +0000 schrieb Paul Grütter:
> Hello Gilad,
>
> Thank you.
>
> Maruan Sahyoun already contacted me with the same tip. It works fine
> but only because we use PDFBox only for rendering and text extraction
> at the moment. If we would use it for other use cases, especially for
> filling in form fields, we would have to create a copy of the
> document for text extraction which is of obviously not optimal in a
> web application that may have multiple documents open at the same
> time.
You could fill the form using PDFBox store it if you have to keep a
copy with the form fields and flatten afterwards and then do the
extraction.
Or you do the text extraction and in addition iterate the form fields
and get the content and location of the form fields widget.
BR
Maruan
>
> Kind regards,
>
> Dipl.-Ing. (FH) Paul Grütter
> Head of Development
>
>
>
> signotec GmbH
> Am Gierath 20b
> 40885 Ratingen (Germany)
>
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
>
> E-Mail: paul.gruetter@signotec.de
> URL: www.signotec.com
>
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
>
>
> Mit freundlichen Grüßen
>
> Dipl.-Ing. (FH) Paul Grütter
> Leiter Entwicklung
>
>
>
> signotec GmbH
> Am Gierath 20b
> 40885 Ratingen
>
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
>
> E-Mail: mailto:paul.gruetter@signotec.de
> URL: https://www.signotec.com/
>
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
>
> Von: Gilad Denneboom <gi...@gmail.com>
> Gesendet: Sonntag, 24. März 2024 22:50
> An: paul.gruetter@signotec.de.invalid
> Cc: users@pdfbox.apache.org
> Betreff: Re: How to search for / extract text of form field
>
>
> Sie erhalten nicht oft eine E-Mail von
> mailto:gilad.denneboom@gmail.com.
> https://aka.ms/LearnAboutSenderIdentification
>
> Flatten the form fields before searching the file if you want
> PDFTextStripper to find the text in them.
>
> On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter
> <ma...@signotec.de.invalid> wrote:
> Hello list,
>
> I want to search for words in a PDF document and get their positions.
> It seems that PDFBox ignores text which has been entered into a form
> field although it’s rendered correctly. I can be reproduced easily
> with the standalone app:
>
> java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
> java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
>
> The Acrobat both finds and extracts text which have been entered into
> a form field.
>
> In my code I use PDFTextStripper. I haven’t found any way to
> configure the behaviour. Is it a bug or have I overlooked something?
> For clarification: I don’t want to search for the value (‘V’) but its
> visual representation (‘AP’).
>
> Kind regards,
>
> Dipl.-Ing. (FH) Paul Grütter
> Head of Development
>
>
>
> signotec GmbH
> Am Gierath 20b
> 40885 Ratingen (Germany)
>
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
>
> E-Mail: mailto:paul.gruetter@signotec.de
> URL: http://www.signotec.com/
>
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: mailto:users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: mailto:users-help@pdfbox.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
AW: How to search for / extract text of form field
Posted by Paul Grütter <pa...@signotec.de.INVALID>.
Hello Gilad,
Thank you.
Maruan Sahyoun already contacted me with the same tip. It works fine but only because we use PDFBox only for rendering and text extraction at the moment. If we would use it for other use cases, especially for filling in form fields, we would have to create a copy of the document for text extraction which is of obviously not optimal in a web application that may have multiple documents open at the same time.
Kind regards,
Dipl.-Ing. (FH) Paul Grütter
Head of Development
signotec GmbH
Am Gierath 20b
40885 Ratingen (Germany)
Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39
E-Mail: paul.gruetter@signotec.de
URL: www.signotec.com
Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes
Mit freundlichen Grüßen
Dipl.-Ing. (FH) Paul Grütter
Leiter Entwicklung
signotec GmbH
Am Gierath 20b
40885 Ratingen
Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39
E-Mail: mailto:paul.gruetter@signotec.de
URL: https://www.signotec.com/
Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes
Von: Gilad Denneboom <gi...@gmail.com>
Gesendet: Sonntag, 24. März 2024 22:50
An: paul.gruetter@signotec.de.invalid
Cc: users@pdfbox.apache.org
Betreff: Re: How to search for / extract text of form field
Sie erhalten nicht oft eine E-Mail von mailto:gilad.denneboom@gmail.com. https://aka.ms/LearnAboutSenderIdentification
Flatten the form fields before searching the file if you want PDFTextStripper to find the text in them.
On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter <ma...@signotec.de.invalid> wrote:
Hello list,
I want to search for words in a PDF document and get their positions. It seems that PDFBox ignores text which has been entered into a form field although it’s rendered correctly. I can be reproduced easily with the standalone app:
java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
The Acrobat both finds and extracts text which have been entered into a form field.
In my code I use PDFTextStripper. I haven’t found any way to configure the behaviour. Is it a bug or have I overlooked something? For clarification: I don’t want to search for the value (‘V’) but its visual representation (‘AP’).
Kind regards,
Dipl.-Ing. (FH) Paul Grütter
Head of Development
signotec GmbH
Am Gierath 20b
40885 Ratingen (Germany)
Tel.: +49 2102 53575-10
Fax: +49 2102 53575-39
E-Mail: mailto:paul.gruetter@signotec.de
URL: http://www.signotec.com/
Amtsgericht Düsseldorf: HRB 44307
Geschäftsführung/CEO: Arne Brandes
---------------------------------------------------------------------
To unsubscribe, e-mail: mailto:users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: mailto:users-help@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org
Re: How to search for / extract text of form field
Posted by Gilad Denneboom <gi...@gmail.com>.
Flatten the form fields before searching the file if you want
PDFTextStripper to find the text in them.
On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter
<pa...@signotec.de.invalid> wrote:
> Hello list,
>
>
>
> I want to search for words in a PDF document and get their positions. It
> seems that PDFBox ignores text which has been entered into a form field
> although it’s rendered correctly. I can be reproduced easily with the
> standalone app:
>
>
>
> java -jar pdfbox-app-3.0.2.jar export:text -i=Test.pdf
>
> java -jar pdfbox-app-3.0.2.jar render -i=Test.pdf
>
>
>
> The Acrobat both finds and extracts text which have been entered into a
> form field.
>
>
>
> In my code I use PDFTextStripper. I haven’t found any way to configure the
> behaviour. Is it a bug or have I overlooked something? For clarification: I
> don’t want to search for the value (‘V’) but its visual representation
> (‘AP’).
>
>
>
> Kind regards,
>
>
>
> Dipl.-Ing. (FH) Paul Grütter
>
> Head of Development
>
>
>
> *[image: Beschreibung: Beschreibung: Beschreibung:
> signotec_eSig_96dpi_192x44px_cmyk-]*
>
>
>
> *signotec GmbH*
>
> Am Gierath 20b
> 40885 Ratingen (Germany)
>
>
>
> Tel.: +49 2102 53575-10
> Fax: +49 2102 53575-39
>
>
>
> E-Mail: paul.gruetter@signotec.de
>
> URL: www.signotec.com
>
>
> Amtsgericht Düsseldorf: HRB 44307
> Geschäftsführung/CEO: Arne Brandes
>
> <https://www.facebook.com/signotecgmbh/>
> <https://www.instagram.com/signotec_gmbh/>
> <https://www.linkedin.com/company/signotec-gmbh/>
> <https://www.xing.com/pages/signotecgmbh>
> <https://www.youtube.com/user/signotec1>
>
>
>
> <https://en.signotec.com/sustainability>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>