You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Nicolas Paris <ni...@riseup.net> on 2018/11/28 22:05:00 UTC

extracting checkboxes in non acroform pdf

Hi

I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
the content as text, including the checkboxes values in it.

THe pdf looks like a regular form pdf with checkboxes. However it is not
a acro form based pdf, and the regular pdfbox code I use in this case
does not apply : the acroform is null !

I wonder how I can iterate on those checkboxes (or visually equivalent)
objects or symbols.

If someone can give me a starter to list all objects in that pdf, that
might be helpful to begin with.

Thanks by advance,

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

> Am 29.11.2018 um 20:56 schrieb Tilman Hausherr <TH...@t-online.de>:
> 
> Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
>> Hi
>> 
>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>> I opened the pdf in a text editor, and I can say the boxes are in a
>> stream xml entity, in binary format. (By removing some binary, I have
>> been able to remove the boxes.
>> Does it exclude the XFA form pdf nature ?
> 
> 
> Sorry, "nature" looks like a bad translation, and sadly I don't know what you meant...  please write that part in french, which I understand too.
> 
> PDFBox doesn't have an API for the XFA form.

That's not completely correct. If there is an XFA form AcroForm.getXFA().getDocument() will return the XFA as an XML Document object and AcroForm.getXFA().getBytes() will return the (XML) content. From there you are on your own and need to process the XML.

BR
Maruan 

> 
> You can also upload the PDF to a sharehoster (no mail attachments). Or look at the PDF in PDFDebugger.
> 
> 
>> 
>>> It could be ordinary text, then the text stripper would do the job.
>> The regular textstripper does not extract them. Does it exclude the text
>> nature ?
> 
> 
> Same problem with "nature". PDFBox cannot extract XFA forms. It can detect glyphs that are used for forms, e.g. squares.
> 
> Tilman
> 
> 
>> 
>> Thanks a lot
>> 
>> On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>> 
>>> It could be widgets annotations without acroform, then you'd have to analyse
>>> these.
>>> 
>>> It could be ordinary text, then the text stripper would do the job.
>>> 
>>> It could be vector graphics, then it gets really difficult.
>>> 
>>> Tilman
>>> 
>>> Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
>>>> Hi
>>>> 
>>>> I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
>>>> the content as text, including the checkboxes values in it.
>>>> 
>>>> THe pdf looks like a regular form pdf with checkboxes. However it is not
>>>> a acro form based pdf, and the regular pdfbox code I use in this case
>>>> does not apply : the acroform is null !
>>>> 
>>>> I wonder how I can iterate on those checkboxes (or visually equivalent)
>>>> objects or symbols.
>>>> 
>>>> If someone can give me a starter to list all objects in that pdf, that
>>>> might be helpful to begin with.
>>>> 
>>>> Thanks by advance,
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 29.11.2018 um 21:27 schrieb Nicolas Paris:
> On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote:
>> Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
>>> Hi
>>>
>>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>> I opened the pdf in a text editor, and I can say the boxes are in a
>>> stream xml entity, in binary format. (By removing some binary, I have
>>> been able to remove the boxes.
>>> Does it exclude the XFA form pdf nature ?
>>
>> Sorry, "nature" looks like a bad translation, and sadly I don't know what
>> you meant...  please write that part in french, which I understand too.
> I meant, "do the above informations prove it is *not* a XFA form ?". I
> mean, the boxes arent in xml but in the binary part.


Open the file with PDFDebugger, switch to "show internal structure" in 
the menu, and if you find "Root/AcroForm/XFA" then it has XFA. (Some 
PDFs also have normal fields, e.g.

https://www.pdfscripting.com/public/FreeStuff/PDFSamples/DynamicEmail_XFAForm_V2.pdf 
)

I also don''t know what you mean with "binary part". The compressed PDF 
content stream? That one you can see by going to 
Root/Pages/Kids/[0]/Contents .


>
>
>> PDFBox doesn't have an API for the XFA form.
>>
>> You can also upload the PDF to a sharehoster (no mail attachments). Or look
>> at the PDF in PDFDebugger.
> I cannot share any copy of the pdf. Thanks for that proposition that
> would help a lot.
>
>>>> It could be ordinary text, then the text stripper would do the job.
>>> The regular textstripper does not extract them. Does it exclude the text
>>> nature ?
>>
>> Same problem with "nature". PDFBox cannot extract XFA forms. It can detect
>> glyphs that are used for forms, e.g. squares.
> I meant, "if the built-in pdfbox text stripper does not extract the
> check-boxes, does it prove that they are not ordinary text."


I can't tell without seeing the PDF.


>
>
>
> How could I determine the kind of checkbox I have ? Is there a way to
> list all the objects within the pdf ?


PDF isn't like XML where there is a tree. PDF has an object structure, 
and there are also content streams, which contains graphics and texts... 
the text is often not easy to "see" because of encoding.

PDFDebugger can really show you all in a PDF, the problem is to 
understand it... especially the content stream.

In theory, I could write an algorithm to answer the question. But it 
would probably take several hours, because there are so many ways to do 
the same thing in PDF. And I might still have forgotten something.

In short: What I would do is first look at the acroform (field? xfa?), 
then look at the page annotations (location?), then at the content 
stream (see PDF specification, "operator summary"). In the content 
stream I would try to find the boxes by knowing their location from 
having looked at the rendered page and move the mouse.



Tilman



>
>
>>> On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
>>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>>>
>>>> It could be widgets annotations without acroform, then you'd have to analyse
>>>> these.
>>>>
>>>> It could be ordinary text, then the text stripper would do the job.
>>>>
>>>> It could be vector graphics, then it gets really difficult.
>>>>
>>>> Tilman
>>>>
>>>> Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
>>>>> Hi
>>>>>
>>>>> I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
>>>>> the content as text, including the checkboxes values in it.
>>>>>
>>>>> THe pdf looks like a regular form pdf with checkboxes. However it is not
>>>>> a acro form based pdf, and the regular pdfbox code I use in this case
>>>>> does not apply : the acroform is null !
>>>>>
>>>>> I wonder how I can iterate on those checkboxes (or visually equivalent)
>>>>> objects or symbols.
>>>>>
>>>>> If someone can give me a starter to list all objects in that pdf, that
>>>>> might be helpful to begin with.
>>>>>
>>>>> Thanks by advance,
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Dick Martin <rt...@nycap.rr.com>.

Yes.  There "Is there a way to list all the objects within the pdf"
That's what Tilman meant when he said "Or look at the PDF in PDFDebugger."
The PDFDebugger is a utility included in the PDFBox download (or maybe
separately downloadable?)

On Thu, Nov 29, 2018 at 3:27 PM Nicolas Paris <ni...@riseup.net>
wrote:

> On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote:
> > Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
> > > Hi
> > >
> > > > It could be an XFA forms pdf... then you'd have to analyze the XML
> content.
> > > I opened the pdf in a text editor, and I can say the boxes are in a
> > > stream xml entity, in binary format. (By removing some binary, I have
> > > been able to remove the boxes.
> > > Does it exclude the XFA form pdf nature ?
> >
> >
> > Sorry, "nature" looks like a bad translation, and sadly I don't know what
> > you meant...  please write that part in french, which I understand too.
>
> I meant, "do the above informations prove it is *not* a XFA form ?". I
> mean, the boxes arent in xml but in the binary part.
>
>
> >
> > PDFBox doesn't have an API for the XFA form.
> >
> > You can also upload the PDF to a sharehoster (no mail attachments). Or
> look
> > at the PDF in PDFDebugger.
>
> I cannot share any copy of the pdf. Thanks for that proposition that
> would help a lot.
>
> > >
> > > > It could be ordinary text, then the text stripper would do the job.
> > > The regular textstripper does not extract them. Does it exclude the
> text
> > > nature ?
> >
> >
> > Same problem with "nature". PDFBox cannot extract XFA forms. It can
> detect
> > glyphs that are used for forms, e.g. squares.
>
> I meant, "if the built-in pdfbox text stripper does not extract the
> check-boxes, does it prove that they are not ordinary text."
>
>
>
> How could I determine the kind of checkbox I have ? Is there a way to
> list all the objects within the pdf ?
>
>
> > >
> > > On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
> > > > It could be an XFA forms pdf... then you'd have to analyze the XML
> content.
> > > >
> > > > It could be widgets annotations without acroform, then you'd have to
> analyse
> > > > these.
> > > >
> > > > It could be ordinary text, then the text stripper would do the job.
> > > >
> > > > It could be vector graphics, then it gets really difficult.
> > > >
> > > > Tilman
> > > >
> > > > Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
> > > > > Hi
> > > > >
> > > > > I have several pdf created with PDFCreator 2.0.1.0 and I want to
> extract
> > > > > the content as text, including the checkboxes values in it.
> > > > >
> > > > > THe pdf looks like a regular form pdf with checkboxes. However it
> is not
> > > > > a acro form based pdf, and the regular pdfbox code I use in this
> case
> > > > > does not apply : the acroform is null !
> > > > >
> > > > > I wonder how I can iterate on those checkboxes (or visually
> equivalent)
> > > > > objects or symbols.
> > > > >
> > > > > If someone can give me a starter to list all objects in that pdf,
> that
> > > > > might be helpful to begin with.
> > > > >
> > > > > Thanks by advance,
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
> --
> nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: extracting checkboxes in non acroform pdf

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

> Am 29.11.2018 um 21:27 schrieb Nicolas Paris <ni...@riseup.net>:
> 
> On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote:
>> Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
>>> Hi
>>> 
>>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>> I opened the pdf in a text editor, and I can say the boxes are in a
>>> stream xml entity, in binary format. (By removing some binary, I have
>>> been able to remove the boxes.
>>> Does it exclude the XFA form pdf nature ?
>> 
>> 
>> Sorry, "nature" looks like a bad translation, and sadly I don't know what
>> you meant...  please write that part in french, which I understand too.
> 
> I meant, "do the above informations prove it is *not* a XFA form ?". I
> mean, the boxes arent in xml but in the binary part.

If PDDocument.getDocumentCatalog().getAcroForm() is null it's doesn't contain an AcroForm definition and it's also not an XFA as this is defined within an AcroForm.

If AcroForm is not null AcroForm.hasXFA() will tell if there is an XFA based form and AcroForm.xfaIsDynamic() will tell if there is only an XFA form definition which means there are no "regular" form fields in the AcroForm.

BR
Maruan


> 
> 
>> 
>> PDFBox doesn't have an API for the XFA form.
>> 
>> You can also upload the PDF to a sharehoster (no mail attachments). Or look
>> at the PDF in PDFDebugger.
> 
> I cannot share any copy of the pdf. Thanks for that proposition that
> would help a lot.
> 
>>> 
>>>> It could be ordinary text, then the text stripper would do the job.
>>> The regular textstripper does not extract them. Does it exclude the text
>>> nature ?
>> 
>> 
>> Same problem with "nature". PDFBox cannot extract XFA forms. It can detect
>> glyphs that are used for forms, e.g. squares.
> 
> I meant, "if the built-in pdfbox text stripper does not extract the
> check-boxes, does it prove that they are not ordinary text."
> 
> 
> 
> How could I determine the kind of checkbox I have ? Is there a way to
> list all the objects within the pdf ?
> 
> 
>>> 
>>> On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
>>>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>>> 
>>>> It could be widgets annotations without acroform, then you'd have to analyse
>>>> these.
>>>> 
>>>> It could be ordinary text, then the text stripper would do the job.
>>>> 
>>>> It could be vector graphics, then it gets really difficult.
>>>> 
>>>> Tilman
>>>> 
>>>> Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
>>>>> Hi
>>>>> 
>>>>> I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
>>>>> the content as text, including the checkboxes values in it.
>>>>> 
>>>>> THe pdf looks like a regular form pdf with checkboxes. However it is not
>>>>> a acro form based pdf, and the regular pdfbox code I use in this case
>>>>> does not apply : the acroform is null !
>>>>> 
>>>>> I wonder how I can iterate on those checkboxes (or visually equivalent)
>>>>> objects or symbols.
>>>>> 
>>>>> If someone can give me a starter to list all objects in that pdf, that
>>>>> might be helpful to begin with.
>>>>> 
>>>>> Thanks by advance,
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> -- 
> nicolas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Nicolas Paris <ni...@riseup.net>.

On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote:
> Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
> > Hi
> > 
> > > It could be an XFA forms pdf... then you'd have to analyze the XML content.
> > I opened the pdf in a text editor, and I can say the boxes are in a
> > stream xml entity, in binary format. (By removing some binary, I have
> > been able to remove the boxes.
> > Does it exclude the XFA form pdf nature ?
> 
> 
> Sorry, "nature" looks like a bad translation, and sadly I don't know what
> you meant...  please write that part in french, which I understand too.

I meant, "do the above informations prove it is *not* a XFA form ?". I
mean, the boxes arent in xml but in the binary part.


> 
> PDFBox doesn't have an API for the XFA form.
> 
> You can also upload the PDF to a sharehoster (no mail attachments). Or look
> at the PDF in PDFDebugger.

I cannot share any copy of the pdf. Thanks for that proposition that
would help a lot.

> > 
> > > It could be ordinary text, then the text stripper would do the job.
> > The regular textstripper does not extract them. Does it exclude the text
> > nature ?
> 
> 
> Same problem with "nature". PDFBox cannot extract XFA forms. It can detect
> glyphs that are used for forms, e.g. squares.

I meant, "if the built-in pdfbox text stripper does not extract the
check-boxes, does it prove that they are not ordinary text."



How could I determine the kind of checkbox I have ? Is there a way to
list all the objects within the pdf ?


> > 
> > On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
> > > It could be an XFA forms pdf... then you'd have to analyze the XML content.
> > > 
> > > It could be widgets annotations without acroform, then you'd have to analyse
> > > these.
> > > 
> > > It could be ordinary text, then the text stripper would do the job.
> > > 
> > > It could be vector graphics, then it gets really difficult.
> > > 
> > > Tilman
> > > 
> > > Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
> > > > Hi
> > > > 
> > > > I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
> > > > the content as text, including the checkboxes values in it.
> > > > 
> > > > THe pdf looks like a regular form pdf with checkboxes. However it is not
> > > > a acro form based pdf, and the regular pdfbox code I use in this case
> > > > does not apply : the acroform is null !
> > > > 
> > > > I wonder how I can iterate on those checkboxes (or visually equivalent)
> > > > objects or symbols.
> > > > 
> > > > If someone can give me a starter to list all objects in that pdf, that
> > > > might be helpful to begin with.
> > > > 
> > > > Thanks by advance,
> > > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 29.11.2018 um 09:49 schrieb Nicolas Paris:
> Hi
>
>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
> I opened the pdf in a text editor, and I can say the boxes are in a
> stream xml entity, in binary format. (By removing some binary, I have
> been able to remove the boxes.
> Does it exclude the XFA form pdf nature ?


Sorry, "nature" looks like a bad translation, and sadly I don't know 
what you meant...  please write that part in french, which I understand too.

PDFBox doesn't have an API for the XFA form.

You can also upload the PDF to a sharehoster (no mail attachments). Or 
look at the PDF in PDFDebugger.


>
>> It could be ordinary text, then the text stripper would do the job.
> The regular textstripper does not extract them. Does it exclude the text
> nature ?


Same problem with "nature". PDFBox cannot extract XFA forms. It can 
detect glyphs that are used for forms, e.g. squares.

Tilman


>
> Thanks a lot
>
> On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
>> It could be an XFA forms pdf... then you'd have to analyze the XML content.
>>
>> It could be widgets annotations without acroform, then you'd have to analyse
>> these.
>>
>> It could be ordinary text, then the text stripper would do the job.
>>
>> It could be vector graphics, then it gets really difficult.
>>
>> Tilman
>>
>> Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
>>> Hi
>>>
>>> I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
>>> the content as text, including the checkboxes values in it.
>>>
>>> THe pdf looks like a regular form pdf with checkboxes. However it is not
>>> a acro form based pdf, and the regular pdfbox code I use in this case
>>> does not apply : the acroform is null !
>>>
>>> I wonder how I can iterate on those checkboxes (or visually equivalent)
>>> objects or symbols.
>>>
>>> If someone can give me a starter to list all objects in that pdf, that
>>> might be helpful to begin with.
>>>
>>> Thanks by advance,
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Nicolas Paris <ni...@riseup.net>.

Hi

> It could be an XFA forms pdf... then you'd have to analyze the XML content.
I opened the pdf in a text editor, and I can say the boxes are in a
stream xml entity, in binary format. (By removing some binary, I have
been able to remove the boxes.
Does it exclude the XFA form pdf nature ?

> It could be ordinary text, then the text stripper would do the job.
The regular textstripper does not extract them. Does it exclude the text
nature ?

Thanks a lot

On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote:
> It could be an XFA forms pdf... then you'd have to analyze the XML content.
> 
> It could be widgets annotations without acroform, then you'd have to analyse
> these.
> 
> It could be ordinary text, then the text stripper would do the job.
> 
> It could be vector graphics, then it gets really difficult.
> 
> Tilman
> 
> Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
> > Hi
> > 
> > I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
> > the content as text, including the checkboxes values in it.
> > 
> > THe pdf looks like a regular form pdf with checkboxes. However it is not
> > a acro form based pdf, and the regular pdfbox code I use in this case
> > does not apply : the acroform is null !
> > 
> > I wonder how I can iterate on those checkboxes (or visually equivalent)
> > objects or symbols.
> > 
> > If someone can give me a starter to list all objects in that pdf, that
> > might be helpful to begin with.
> > 
> > Thanks by advance,
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting checkboxes in non acroform pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

It could be an XFA forms pdf... then you'd have to analyze the XML content.

It could be widgets annotations without acroform, then you'd have to 
analyse these.

It could be ordinary text, then the text stripper would do the job.

It could be vector graphics, then it gets really difficult.

Tilman

Am 28.11.2018 um 23:05 schrieb Nicolas Paris:
> Hi
>
> I have several pdf created with PDFCreator 2.0.1.0 and I want to extract
> the content as text, including the checkboxes values in it.
>
> THe pdf looks like a regular form pdf with checkboxes. However it is not
> a acro form based pdf, and the regular pdfbox code I use in this case
> does not apply : the acroform is null !
>
> I wonder how I can iterate on those checkboxes (or visually equivalent)
> objects or symbols.
>
> If someone can give me a starter to list all objects in that pdf, that
> might be helpful to begin with.
>
> Thanks by advance,
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org