You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Balaji Venkatamohan <bv...@tibco.com> on 2015/05/19 21:35:31 UTC

How to flatedecode and find all acroform fields in a compressed PDF

Hello,

I am using PDFBox 1.8.9 for my product which would read acroform fields and
write into acroform fields of a PDF. Right now, I have two versions of a
PDF given to us by a potential customer, one is compressed and another is
not. The PDF contains a bunch of acroform fields and it is three pages
long. Sorry, I cannot share the PDF because of restrictions.

The size of the uncompressed file is 1.67 MB and the size of the compressed
version is 27 KB.
I could open both the PDFs in Reader X1, foxit reader and many other reader
software and I am able to successfully modify values for acroform fields
and save them.

When I open the uncompressed PDF using PDDocument.load(File f), I am able
to successfully read from and write into the many acroform fields using the
below API calls:

PDDocumentCatalog docCatalog = document.getDocumentCatalog();
 PDAcroForm acroForm = docCatalog.getAcroForm();

However, I am unable to see any acroForm fields when I open the compressed
PDF.

I opened the compressed PDF using notepad and I Could see that the objects
have been compressed using FlateEncoding. The below line is from the first
page of the PDF:

2 0 obj
<</Filter/FlateDecode/Length 5675>>stream
....
....
....
endstream
endobj

I see a FlateFilter.java class with encode and decode methods which are
inturn used by public methods in COSStream.java. I am unable to connect the
dots and flatedecode the PDF.

My question is: how do I flatedecode a PDF so that I can find all the
acroform fields within it. ANy help or pointers would be highly appreciated.

Thanks,
Balaji

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
As I said, I'm not the acroform specialist here. So I can't tell if it is
possible to repair these, and if there aren't side effects that e.g. PDFs
with annotations end up being forms. Yes, we've done all sorts of things to
accomodate broken PDFs. But here the fault is known, it is a website that
deletes data from PDFs to "compress" them. The better solution would be to
have this guy fix his website, i.e. allow options to decide what is to be
removed, and what not. Another solution (which I mentioned before) would be
to have your customer compress his PDFs with the method I mentioned in this
thread, i.e. if this customer of yours generates PDFs, but doesn't have the
knowledge to compress the streams. He could of course look into our source
code (FlateFilter.java, it is just 10 lines)  and see how to compress
himself.

Alright!

On Wed, May 27, 2015 at 11:20 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 27.05.2015 um 20:02 schrieb Balaji Venkatamohan:
>
>> Thanks Tilman for letting the website developer know about the
>> shortcomings
>> of their compression technique.
>>
>> The PDF owner did not share with us the information about which website
>> they used for compressing the PDF. My teammates helped in identifying this
>> website. I will let the customer know about this particular website and
>> will leave it to them regarding continuing to use this website for their
>> PDF documents.
>>
>> Could you also answer the following question please?
>> Would Pdfbox API change its code to accommodate the incorrect condition
>> that annotation fields (editable fields) are outside acro form fields as
>> well? I know the PDF compressed by the website is incorrect and hence I
>> would understand if you don't go ahead with this.
>>
>
> As I said, I'm not the acroform specialist here. So I can't tell if it is
> possible to repair these, and if there aren't side effects that e.g. PDFs
> with annotations end up being forms. Yes, we've done all sorts of things to
> accomodate broken PDFs. But here the fault is known, it is a website that
> deletes data from PDFs to "compress" them. The better solution would be to
> have this guy fix his website, i.e. allow options to decide what is to be
> removed, and what not. Another solution (which I mentioned before) would be
> to have your customer compress his PDFs with the method I mentioned in this
> thread, i.e. if this customer of yours generates PDFs, but doesn't have the
> knowledge to compress the streams. He could of course look into our source
> code (FlateFilter.java, it is just 10 lines)  and see how to compress
> himself.
>
> Tilman
>
>
>
>> Thanks,
>> Balaji
>>
>>
>> On Tue, May 26, 2015 at 10:45 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>  I just tested it. It also removes /Outlines and /Metadata and more
>>> important data from PDF files.
>>>
>>> So your client can't share the PDF with us, but he shared it some
>>> website.
>>>
>>> A little research shows that this website is owned by Lauri Lehtinen from
>>> Talinn, Estonia.
>>> http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
>>> https://www.linkedin.com/in/laurilehtinen
>>> https://twitter.com/laurii
>>>
>>> I also tweeted him.
>>>
>>> Tilman
>>>
>>>
>>> Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:
>>>
>>>  Okay, I found out the online tool used by the customer to compress their
>>>> PDF.
>>>>
>>>> It is : https://www.pdfcompress.com/
>>>>
>>>> I don't need to rely on the PDF sent by the customer because all PDFs
>>>> that
>>>> are available on the web, are compressed in the same manner by this
>>>> tool,
>>>> that is, it gets rid of all acro form fields during compression.
>>>>
>>>> For example, the f941 govt form available at this site:
>>>> http://www.irs.gov/pub/irs-pdf/f941.pdf
>>>> If we compress this using the online tool, the resultant file size is
>>>> very
>>>> low, which is good. However, there are no acro form fields in the
>>>> compressed PDF.
>>>>
>>>> Thanks,
>>>> Balaji
>>>>
>>>>
>>>>
>>>> On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sahyoun@fileaffairs.de
>>>> >
>>>> wrote:
>>>>
>>>>   Hi,
>>>>
>>>>>   Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <
>>>>> bvenkata@tibco.com
>>>>>
>>>>>> :
>>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> So AcroForms/Fields is an empty Array?
>>>>>>
>>>>>> Yes, in the filled interview_compressed.pdf, the acroforms are not
>>>>>> null
>>>>>>
>>>>>>  but
>>>>>
>>>>>  empty. Size of array is zero.
>>>>>>
>>>>>> Also, I tried qpdf command line tool to compress the file
>>>>>> interview.pdf
>>>>>>
>>>>>>  and
>>>>>
>>>>>  the resultant compressed file size of 1.6MB was no way near the file
>>>>>> size
>>>>>> of interview_compressed.pdf (21 KB).
>>>>>>
>>>>>>  would you think it's possible to get a similar PDF file or
>>>>> permission to
>>>>> use it internally so we have a sample to look at a potential fix.
>>>>>
>>>>> Although the PDF is not inline with the spec as Acrobat is able to
>>>>> handle
>>>>> it we could look into getting a similar result.
>>>>>
>>>>> BR
>>>>> Maruan
>>>>>
>>>>>
>>>>>   Thanks,
>>>>>
>>>>>> Balaji
>>>>>>
>>>>>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <
>>>>>> sahyoun@fileaffairs.de
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>>   Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <
>>>>>>>
>>>>>>>> bvenkata@tibco.com
>>>>>>>>
>>>>>>>>  :
>>>>>>> I opened the interview_compressed in notepad++ and did not see any
>>>>>>>
>>>>>>>> 'Acroform' text anywhere.
>>>>>>>> However, as Maruan suggested, I entered some data into what looks
>>>>>>>> like
>>>>>>>>
>>>>>>>>  form
>>>>>>>
>>>>>>>  fields of interview_compressed.pdf and saved it. When I opened this
>>>>>>>>
>>>>>>>>  file
>>>>>>>
>>>>>> in
>>>>>>
>>>>>>> notepad++, I did see 'Acroform' text in it. I also noticed an
>>>>>>>> increase
>>>>>>>>
>>>>>>>>  in
>>>>>>>
>>>>>> file size from 21 KB to ~530 KB.
>>>>>>
>>>>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and
>>>>>>>> saw
>>>>>>>> that the field values were getting stored but not under Acroform
>>>>>>>> fields
>>>>>>>>
>>>>>>>>  but
>>>>>>>
>>>>>>>  under Annotations.
>>>>>>>>
>>>>>>>>
>>>>>>> So AcroForms/Fields is an empty Array?
>>>>>>>
>>>>>>>   Please refer to this image:
>>>>>>>
>>>>>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>>>>>>>
>>>>>>>> So, whatever the compression technique was, it simply made all the
>>>>>>>>
>>>>>>>>  Acroform
>>>>>>>
>>>>>>>  fields disappear from the original PDF but retained all annotations
>>>>>>>>
>>>>>>>>  which
>>>>>>>
>>>>>> also contain the interactive forms and this helped reduce the file
>>>>>> size
>>>>>>
>>>>>>> so
>>>>>>>
>>>>>>>  much? If this is the case, can pdfbox API also use similar
>>>>>>>> compression
>>>>>>>> technique to compress such a a huge file into a smaller one?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
>>>>>>>>
>>>>>>>>  sahyoun@fileaffairs.de>
>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>>   Hi,
>>>>>>>>
>>>>>>>>>   Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
>>>>>>>>> THausherr@t-online.de
>>>>>>>>>
>>>>>>>> :
>>>>>>
>>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I used PdfDebugger to make the internal PDF structure of the two
>>>>>>>>>>>
>>>>>>>>>>>  files
>>>>>>>>>>
>>>>>>>>> (1)
>>>>>>
>>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>>>>>>>>>> and I
>>>>>>>>>>
>>>>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>>>>
>>>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>>>>>>>
>>>>>>>>>>> The first two links are from the internal structure of
>>>>>>>>>>> interview.pdf
>>>>>>>>>>> (original uncompressed file)
>>>>>>>>>>> The third and fourth links are from the internal structure of
>>>>>>>>>>> interview_compressed.pdf (compressed file)
>>>>>>>>>>> The fifth link compares the file sizes of the two files and as
>>>>>>>>>>> you
>>>>>>>>>>>
>>>>>>>>>>>  can
>>>>>>>>>>
>>>>>>>>> also
>>>>>>
>>>>>>> see, the difference is huge.
>>>>>>>>>>
>>>>>>>>>>> As you might notice, the file interview_compressed.pdf has no
>>>>>>>>>>>
>>>>>>>>>>>  acroform
>>>>>>>>>>
>>>>>>>>> Indeed... but this is needed - from the spec:
>>>>>>
>>>>>>> "The contents and properties of a document’s interactive form shall
>>>>>>>>>>
>>>>>>>>>>  be
>>>>>>>>>
>>>>>>>> defined by an interactive form dictionary that shall be referenced
>>>>>>
>>>>>>> from
>>>>>>>>
>>>>>>> the
>>>>>>
>>>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>>>>>>>> Catalog”).
>>>>>>>> Table 218 shows the contents of this dictionary."
>>>>>>>>
>>>>>>>>> correct
>>>>>>>>>
>>>>>>>>>   fields listed even though opening the PDF in pdf reader allows
>>>>>>>>> me to
>>>>>>>>>
>>>>>>>>>> enter
>>>>>>>>>> values in places which look like AcroForm fields and also save
>>>>>>>>>> them.
>>>>>>>>>> Are
>>>>>>>>>>
>>>>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>>>>>>
>>>>>>>>> enable
>>>>>>>>>> users to fill data and which can be accessed in PdfBox APIs
>>>>>>>>>> without
>>>>>>>>>> having
>>>>>>>>>> to go through PDAcrofield?
>>>>>>>>>> Yes, annotations... there are some common parts, but this is just
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>  vague observation from me, I'm not the acroform specialist.
>>>>>>>>>
>>>>>>>>> from a first glance it looks like there are all entries necessary
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>  (re-)
>>>>>>>> generate the form fields. That's what's likely happening for this
>>>>>>>> document
>>>>>>>> in Adobe Reader. Would be interesting to see what's being save after
>>>>>>>> the
>>>>>>>>
>>>>>>> forms has been filled out and saved using Acrobat. We'd need a test
>>>>>>
>>>>>>> form to
>>>>>>>> come up with an enhancement like this.
>>>>>>>>
>>>>>>>>> BR
>>>>>>>>> Maruan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   What you should do: use NOTEPAD++ to look whether there's
>>>>>>>>>
>>>>>>>>>> "/AcroForm"
>>>>>>>>>>
>>>>>>>>>>  in
>>>>>>>>>
>>>>>>>> the "compressed" file.
>>>>>>>>
>>>>>>>>> - if it is missing, tell the client (or your boss) just that
>>>>>>>>>> - if it isn't missing, then there's some problem in PDFBox (try
>>>>>>>>>> also
>>>>>>>>>>
>>>>>>>>>>  the
>>>>>>>>>
>>>>>>>> loadNonSeq I mentioned earlier)
>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>>   You can use qpdf , then use these options:
>>>>>>>>>>
>>>>>>>>>>> I will now try using this link to compress the original file.
>>>>>>>>>>>
>>>>>>>>>>> Another strategy to think about - can your client generate a
>>>>>>>>>>> non-confidential file, so that you can share it, and the
>>>>>>>>>>>
>>>>>>>>>>>  "compressed"
>>>>>>>>>>
>>>>>>>>> file?
>>>>>>
>>>>>>> I wish I had direct communication with the clients but due to
>>>>>>>>>> bureaucracy,
>>>>>>>>>> I am having to go through multiple layers to get my message across
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>> them.
>>>>>>
>>>>>>> I will share more information as soon as I have them.
>>>>>>>>>>
>>>>>>>>>>> PS: i sent these image links to my personal email first to make
>>>>>>>>>>> sure
>>>>>>>>>>>
>>>>>>>>>>>  that I
>>>>>>>>>> can open them. I could and so I am hoping you all could too. If
>>>>>>>>>> you
>>>>>>>>>> are
>>>>>>>>>>
>>>>>>>>> unable to open them, please let me know.
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>>> Balaji
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>>>>>>>>>>>
>>>>>>>>>>>  THausherr@t-online.de
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>   Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>>>>>>
>>>>>>>>>>>>   Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015
>>>>>>>>>>>>> um
>>>>>>>>>>>>>
>>>>>>>>>>>>>  03:24
>>>>>>>>>>>>
>>>>>>>>>>> geschrieben:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  attaching it
>>>>>>>>>>>>>
>>>>>>>>>>>> with this email.
>>>>>>>>>>
>>>>>>>>>>> The point I am trying to make is that the PDF, which was
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  decompressed
>>>>>>>>>>>>>
>>>>>>>>>>>> using
>>>>>>>>
>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  to
>>>>>>>>>>>>>
>>>>>>>>>>>> us by
>>>>>>
>>>>>>> our customers.
>>>>>>>>>>
>>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  PDFBox
>>>>>>>>>>>>>
>>>>>>>>>>>> did
>>>>>>
>>>>>>> not
>>>>>>>>>>
>>>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  us
>>>>>>>>>>>>>
>>>>>>>>>>>> by
>>>>>>
>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  how
>>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>
>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I
>>>>>>>>
>>>>>>>>> was
>>>>>>>>>>>>>
>>>>>>>>>>>> analyzing COSStream was to check if the decompression of the
>>>>>>
>>>>>>> compressed
>>>>>>>>>>>>>
>>>>>>>>>>>> PDF
>>>>>>>>>>
>>>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>>>>>>> I know it would have been difficult for you to help me without
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  the
>>>>>>>>>>>>>
>>>>>>>>>>>> actual
>>>>>>
>>>>>>> PDFs. For that, I would like to thank you for your time and
>>>>>>>>>>
>>>>>>>>>>> pointers.
>>>>>>>>>>>>>
>>>>>>>>>>>> Maybe it's worth to try to share the file "visually" with us.
>>>>>>>> Open
>>>>>>>>
>>>>>>>>> both
>>>>>>>>>>>>
>>>>>>>>>>> files
>>>>>>>>>>
>>>>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>>>>>>>>>>>
>>>>>>>>>>>>>  screenshot
>>>>>>>>>>>>
>>>>>>>>>>> of both
>>>>>>>>>>
>>>>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>>>>>>>>>>>>>
>>>>>>>>>>>>>  could
>>>>>>>>>>>>
>>>>>>>>>>> shed some
>>>>>>>>
>>>>>>>>> light on your issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>>   @Balaji: here's an example on how such a screenshot would
>>>>>>>>>>>>> look
>>>>>>>>>>>>>
>>>>>>>>>>>> like:
>>>>>>>>>>>
>>>>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>
>>>>>>> Tilman
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   BR
>>>>>>>>>>>>
>>>>>>>>>>>>> Andreas Lehmkühler
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>>>>>>>>>>>
>>>>>>>>>>>>>  THausherr@t-online.de>
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  The image doesn't appear in the mailing list.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is all very confusing... /acroform is in the document
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  catalog.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I
>>>>>>>>
>>>>>>>>> don't see how the page content stream is related to it. The best
>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>>>>
>>>>>>>>>>>>> that
>>>>>>>>
>>>>>>>>> you either go through the source code, or read the spec and then
>>>>>>>>>>
>>>>>>>>>>> look at
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the pdf.
>>>>>>>>>>
>>>>>>>>>>> To find out what's going on, you'd have to start from that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  /acroform
>>>>>>>>>>>>>>
>>>>>>>>>>>>> entry
>>>>>>>>
>>>>>>>>> and then compare the two files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is really difficult to help you without the files. The
>>>>>>>>>>>>>>> cause
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  could
>>>>>>>>>>>>>>
>>>>>>>>>>>>> be a
>>>>>>>>>>
>>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some more ideas:
>>>>>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>>>>>>> - try the unreleased 2.0 version, that one has some
>>>>>>>>>>>>>>> improvements
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>
>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>>
>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  smallest
>>>>>>>>>>>>>>
>>>>>>>>>>>>> possible code that fails, and 2) post a small part of the raw
>>>>>>>>>>
>>>>>>>>>>> PDF,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> i.e.
>>>>>>
>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Tilman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  pages), I
>>>>>>>>>>>>>>
>>>>>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>
>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>>>>>>              pdStream=firstPage.getContents();
>>>>>>>>>>>>>>>>              COSStream stream=pdStream.getStream();
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the above code snippet, the object stream, when analyzed
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  debug
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mode, has the following:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++
>>>>>>>>>>>>>>>> is :
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   From this point on, using the COSStream object for every
>>>>>>>>>>>>>>>> page,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  how
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can I
>>>>>>>>
>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>>>>>>> unFilteredStream
>>>>>>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      Thank you for your response Tilman.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      I had previously tried using the WriteDecodedDoc for my
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  compressed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      PDF and I tried to get the number of acro form fields
>>>>>>>>>>
>>>>>>>>>>> present
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> in
>>>>>>
>>>>>>> the output file generated by WriteDecodedDoc. The API still
>>>>>>>>>>
>>>>>>>>>>> could
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      not find the acro form fields in the generated
>>>>>>>> decompressed
>>>>>>>>
>>>>>>>>> file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       Also the decompressed file generated is 75 KB which is
>>>>>>>>>> far
>>>>>>>>>>
>>>>>>>>>>> less
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      than the original decompressed file which I have (1.6 MB)
>>>>>>>>>>
>>>>>>>>>>> though I
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      could edit the acro form fields using acrobat reader.
>>>>>>>>>>
>>>>>>>>>>>      Thanks,
>>>>>>>>>>>>>>>>      Balaji
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>>>>>>      <THausherr@t-online.de <ma...@t-online.de>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>          Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>
>>>>>>>>>              My question is: how do I flatedecode a PDF so
>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  can
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>              find all the
>>>>>>>>>>
>>>>>>>>>>>              acroform fields within it. ANy help or pointers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  would
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> be
>>>>>>
>>>>>>>              highly appreciated.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>          You could try the WriteDecodedDoc option of the
>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  line
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> app
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>>
>>>>>>>>>           Maybe you can have further ideas by comparing the two
>>>>>>>>>>
>>>>>>>>>>> files
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>          with NOTEPAD++.... however the two files might have
>>>>>>>>>>
>>>>>>>>>>> their
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>          objects in different order.
>>>>>>
>>>>>>>          Tilman
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>           To unsubscribe, e-mail:
>>>>>>>>>>
>>>>>>>>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>          <ma...@pdfbox.apache.org>
>>>>>>>>>>
>>>>>>>>>>>          For additional commands, e-mail:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  users-help@pdfbox.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>          <ma...@pdfbox.apache.org>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>  To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>>  To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>>  To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> <mailto:
>>>>>>>>>>
>>>>>>>>>>  users-unsubscribe@pdfbox.apache.org>
>>>>>>>>>
>>>>>>>>>  For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>  <mailto:
>>>>>>>>>
>>>>>>>> users-help@pdfbox.apache.org>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 27.05.2015 um 20:02 schrieb Balaji Venkatamohan:
> Thanks Tilman for letting the website developer know about the shortcomings
> of their compression technique.
>
> The PDF owner did not share with us the information about which website
> they used for compressing the PDF. My teammates helped in identifying this
> website. I will let the customer know about this particular website and
> will leave it to them regarding continuing to use this website for their
> PDF documents.
>
> Could you also answer the following question please?
> Would Pdfbox API change its code to accommodate the incorrect condition
> that annotation fields (editable fields) are outside acro form fields as
> well? I know the PDF compressed by the website is incorrect and hence I
> would understand if you don't go ahead with this.

As I said, I'm not the acroform specialist here. So I can't tell if it 
is possible to repair these, and if there aren't side effects that e.g. 
PDFs with annotations end up being forms. Yes, we've done all sorts of 
things to accomodate broken PDFs. But here the fault is known, it is a 
website that deletes data from PDFs to "compress" them. The better 
solution would be to have this guy fix his website, i.e. allow options 
to decide what is to be removed, and what not. Another solution (which I 
mentioned before) would be to have your customer compress his PDFs with 
the method I mentioned in this thread, i.e. if this customer of yours 
generates PDFs, but doesn't have the knowledge to compress the streams. 
He could of course look into our source code (FlateFilter.java, it is 
just 10 lines)  and see how to compress himself.

Tilman

>
> Thanks,
> Balaji
>
>
> On Tue, May 26, 2015 at 10:45 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> I just tested it. It also removes /Outlines and /Metadata and more
>> important data from PDF files.
>>
>> So your client can't share the PDF with us, but he shared it some website.
>>
>> A little research shows that this website is owned by Lauri Lehtinen from
>> Talinn, Estonia.
>> http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
>> https://www.linkedin.com/in/laurilehtinen
>> https://twitter.com/laurii
>>
>> I also tweeted him.
>>
>> Tilman
>>
>>
>> Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:
>>
>>> Okay, I found out the online tool used by the customer to compress their
>>> PDF.
>>>
>>> It is : https://www.pdfcompress.com/
>>>
>>> I don't need to rely on the PDF sent by the customer because all PDFs that
>>> are available on the web, are compressed in the same manner by this tool,
>>> that is, it gets rid of all acro form fields during compression.
>>>
>>> For example, the f941 govt form available at this site:
>>> http://www.irs.gov/pub/irs-pdf/f941.pdf
>>> If we compress this using the online tool, the resultant file size is very
>>> low, which is good. However, there are no acro form fields in the
>>> compressed PDF.
>>>
>>> Thanks,
>>> Balaji
>>>
>>>
>>>
>>> On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sa...@fileaffairs.de>
>>> wrote:
>>>
>>>   Hi,
>>>>   Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bvenkata@tibco.com
>>>>>> :
>>>>> Hi,
>>>>>
>>>>> So AcroForms/Fields is an empty Array?
>>>>>
>>>>> Yes, in the filled interview_compressed.pdf, the acroforms are not null
>>>>>
>>>> but
>>>>
>>>>> empty. Size of array is zero.
>>>>>
>>>>> Also, I tried qpdf command line tool to compress the file interview.pdf
>>>>>
>>>> and
>>>>
>>>>> the resultant compressed file size of 1.6MB was no way near the file
>>>>> size
>>>>> of interview_compressed.pdf (21 KB).
>>>>>
>>>> would you think it's possible to get a similar PDF file or permission to
>>>> use it internally so we have a sample to look at a potential fix.
>>>>
>>>> Although the PDF is not inline with the spec as Acrobat is able to handle
>>>> it we could look into getting a similar result.
>>>>
>>>> BR
>>>> Maruan
>>>>
>>>>
>>>>   Thanks,
>>>>> Balaji
>>>>>
>>>>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <
>>>>> sahyoun@fileaffairs.de
>>>>>
>>>>> wrote:
>>>>>
>>>>>   Hi,
>>>>>>   Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <
>>>>>>> bvenkata@tibco.com
>>>>>>>
>>>>>> :
>>>>>> I opened the interview_compressed in notepad++ and did not see any
>>>>>>> 'Acroform' text anywhere.
>>>>>>> However, as Maruan suggested, I entered some data into what looks like
>>>>>>>
>>>>>> form
>>>>>>
>>>>>>> fields of interview_compressed.pdf and saved it. When I opened this
>>>>>>>
>>>>>> file
>>>>> in
>>>>>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
>>>>>>>
>>>>>> in
>>>>> file size from 21 KB to ~530 KB.
>>>>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and
>>>>>>> saw
>>>>>>> that the field values were getting stored but not under Acroform
>>>>>>> fields
>>>>>>>
>>>>>> but
>>>>>>
>>>>>>> under Annotations.
>>>>>>>
>>>>>>
>>>>>> So AcroForms/Fields is an empty Array?
>>>>>>
>>>>>>   Please refer to this image:
>>>>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>>>>>>
>>>>>>> So, whatever the compression technique was, it simply made all the
>>>>>>>
>>>>>> Acroform
>>>>>>
>>>>>>> fields disappear from the original PDF but retained all annotations
>>>>>>>
>>>>>> which
>>>>> also contain the interactive forms and this helped reduce the file size
>>>>>> so
>>>>>>
>>>>>>> much? If this is the case, can pdfbox API also use similar compression
>>>>>>> technique to compress such a a huge file into a smaller one?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
>>>>>>>
>>>>>> sahyoun@fileaffairs.de>
>>>>> wrote:
>>>>>>>   Hi,
>>>>>>>>   Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
>>>>>>>> THausherr@t-online.de
>>>>> :
>>>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I used PdfDebugger to make the internal PDF structure of the two
>>>>>>>>>>
>>>>>>>>> files
>>>>> (1)
>>>>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>>>>>>>>> and I
>>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>>>>>>
>>>>>>>>>> The first two links are from the internal structure of
>>>>>>>>>> interview.pdf
>>>>>>>>>> (original uncompressed file)
>>>>>>>>>> The third and fourth links are from the internal structure of
>>>>>>>>>> interview_compressed.pdf (compressed file)
>>>>>>>>>> The fifth link compares the file sizes of the two files and as you
>>>>>>>>>>
>>>>>>>>> can
>>>>> also
>>>>>>>>> see, the difference is huge.
>>>>>>>>>> As you might notice, the file interview_compressed.pdf has no
>>>>>>>>>>
>>>>>>>>> acroform
>>>>> Indeed... but this is needed - from the spec:
>>>>>>>>> "The contents and properties of a document’s interactive form shall
>>>>>>>>>
>>>>>>>> be
>>>>> defined by an interactive form dictionary that shall be referenced
>>>>>>> from
>>>>> the
>>>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>>>>>>> Catalog”).
>>>>>>> Table 218 shows the contents of this dictionary."
>>>>>>>> correct
>>>>>>>>
>>>>>>>>   fields listed even though opening the PDF in pdf reader allows me to
>>>>>>>>> enter
>>>>>>>>> values in places which look like AcroForm fields and also save them.
>>>>>>>>> Are
>>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>>>>>>> enable
>>>>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>>>>>>> having
>>>>>>>>> to go through PDAcrofield?
>>>>>>>>> Yes, annotations... there are some common parts, but this is just a
>>>>>>>>>
>>>>>>>> vague observation from me, I'm not the acroform specialist.
>>>>>>>>
>>>>>>>> from a first glance it looks like there are all entries necessary to
>>>>>>>>
>>>>>>> (re-)
>>>>>>> generate the form fields. That's what's likely happening for this
>>>>>>> document
>>>>>>> in Adobe Reader. Would be interesting to see what's being save after
>>>>>>> the
>>>>> forms has been filled out and saved using Acrobat. We'd need a test
>>>>>>> form to
>>>>>>> come up with an enhancement like this.
>>>>>>>> BR
>>>>>>>> Maruan
>>>>>>>>
>>>>>>>>
>>>>>>>>   What you should do: use NOTEPAD++ to look whether there's
>>>>>>>>> "/AcroForm"
>>>>>>>>>
>>>>>>>> in
>>>>>>> the "compressed" file.
>>>>>>>>> - if it is missing, tell the client (or your boss) just that
>>>>>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>>>>>>>>>
>>>>>>>> the
>>>>>>> loadNonSeq I mentioned earlier)
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>   You can use qpdf , then use these options:
>>>>>>>>>> I will now try using this link to compress the original file.
>>>>>>>>>>
>>>>>>>>>> Another strategy to think about - can your client generate a
>>>>>>>>>> non-confidential file, so that you can share it, and the
>>>>>>>>>>
>>>>>>>>> "compressed"
>>>>> file?
>>>>>>>>> I wish I had direct communication with the clients but due to
>>>>>>>>> bureaucracy,
>>>>>>>>> I am having to go through multiple layers to get my message across
>>>>>>>>> to
>>>>> them.
>>>>>>>>> I will share more information as soon as I have them.
>>>>>>>>>> PS: i sent these image links to my personal email first to make
>>>>>>>>>> sure
>>>>>>>>>>
>>>>>>>>> that I
>>>>>>>>> can open them. I could and so I am hoping you all could too. If you
>>>>>>>>> are
>>>>>>> unable to open them, please let me know.
>>>>>>>>>> Thanks,
>>>>>>>>>> Balaji
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>>>>>>>>>>
>>>>>>>>> THausherr@t-online.de
>>>>>>> wrote:
>>>>>>>>>>   Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>>>>>>   Hi,
>>>>>>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
>>>>>>>>>>>>
>>>>>>>>>>> 03:24
>>>>>>>>> geschrieben:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>>>>>>>>>>>
>>>>>>>>>>>> attaching it
>>>>>>>>> with this email.
>>>>>>>>>>>>> The point I am trying to make is that the PDF, which was
>>>>>>>>>>>>>
>>>>>>>>>>>> decompressed
>>>>>>> using
>>>>>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
>>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>> us by
>>>>>>>>> our customers.
>>>>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of
>>>>>>>>>>>>>
>>>>>>>>>>>> PDFBox
>>>>> did
>>>>>>>>> not
>>>>>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to
>>>>>>>>>>>>>
>>>>>>>>>>>> us
>>>>> by
>>>>>>>>> the
>>>>>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know
>>>>>>>>>>>>>
>>>>>>>>>>>> how
>>>>> to
>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I
>>>>>>>>>>>> was
>>>>> analyzing COSStream was to check if the decompression of the
>>>>>>>>>>>> compressed
>>>>>>>>> PDF
>>>>>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>>>>>> I know it would have been difficult for you to help me without
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>> actual
>>>>>>>>> PDFs. For that, I would like to thank you for your time and
>>>>>>>>>>>> pointers.
>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>>>>>>>>>>> both
>>>>>>>>> files
>>>>>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>>>>>>>>>>
>>>>>>>>>>> screenshot
>>>>>>>>> of both
>>>>>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>>>>>>>>>>>>
>>>>>>>>>>> could
>>>>>>> shed some
>>>>>>>>>>>> light on your issue.
>>>>>>>>>>>>
>>>>>>>>>>>>   @Balaji: here's an example on how such a screenshot would look
>>>>>>>>>> like:
>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>>>>>> Tilman
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   BR
>>>>>>>>>>>> Andreas Lehmkühler
>>>>>>>>>>>>
>>>>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>>>>>>>>>>
>>>>>>>>>>> THausherr@t-online.de>
>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is all very confusing... /acroform is in the document
>>>>>>>>>>>>>>
>>>>>>>>>>>>> catalog.
>>>>>>> I
>>>>>>>>> don't see how the page content stream is related to it. The best
>>>>>>>>>>>>> is
>>>>>>> that
>>>>>>>>> you either go through the source code, or read the spec and then
>>>>>>>>>>>>> look at
>>>>>>>>> the pdf.
>>>>>>>>>>>>>> To find out what's going on, you'd have to start from that
>>>>>>>>>>>>>>
>>>>>>>>>>>>> /acroform
>>>>>>> entry
>>>>>>>>>>>>>> and then compare the two files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is really difficult to help you without the files. The cause
>>>>>>>>>>>>>>
>>>>>>>>>>>>> could
>>>>>>>>> be a
>>>>>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some more ideas:
>>>>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>>>>>> - try the unreleased 2.0 version, that one has some
>>>>>>>>>>>>>> improvements
>>>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>> the
>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> smallest
>>>>>>>>> possible code that fails, and 2) post a small part of the raw
>>>>>>>>>>>>> PDF,
>>>>> i.e.
>>>>>>>>> the
>>>>>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tilman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>>>>>>>>>>>>
>>>>>>>>>>>>> pages), I
>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>>>>>              pdStream=firstPage.getContents();
>>>>>>>>>>>>>>>              COSStream stream=pdStream.getStream();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> debug
>>>>>>>>> mode, has the following:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   From this point on, using the COSStream object for every
>>>>>>>>>>>>>>> page,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> how
>>>>>>> can I
>>>>>>>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>>>>>> unFilteredStream
>>>>>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      Thank you for your response Tilman.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      I had previously tried using the WriteDecodedDoc for my
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> compressed
>>>>>>>>>      PDF and I tried to get the number of acro form fields
>>>>>>>>>>>>>> present
>>>>> in
>>>>>>>>> the output file generated by WriteDecodedDoc. The API still
>>>>>>>>>>>>>> could
>>>>>>>      not find the acro form fields in the generated decompressed
>>>>>>>>>>>>>> file.
>>>>>>>>>       Also the decompressed file generated is 75 KB which is far
>>>>>>>>>>>>>> less
>>>>>>>>>      than the original decompressed file which I have (1.6 MB)
>>>>>>>>>>>>>> though I
>>>>>>>>>      could edit the acro form fields using acrobat reader.
>>>>>>>>>>>>>>>      Thanks,
>>>>>>>>>>>>>>>      Balaji
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>>>>>      <THausherr@t-online.de <ma...@t-online.de>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>          Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>>>              My question is: how do I flatedecode a PDF so
>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> can
>>>>>>>>>              find all the
>>>>>>>>>>>>>>>              acroform fields within it. ANy help or pointers
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> would
>>>>> be
>>>>>>>>>              highly appreciated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          You could try the WriteDecodedDoc option of the
>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> line
>>>>>>>>> app
>>>>>>>>>>>>>>>
>>>>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>
>>>>>>>>>          Maybe you can have further ideas by comparing the two
>>>>>>>>>>>>>> files
>>>>>>>>>          with NOTEPAD++.... however the two files might have
>>>>>>>>>>>>>> their
>>>>>          objects in different order.
>>>>>>>>>>>>>>>          Tilman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>>>          To unsubscribe, e-mail:
>>>>>>>>>>>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>          <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>>>          For additional commands, e-mail:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> users-help@pdfbox.apache.org
>>>>>>>>>          <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> <mailto:
>>>>>>>>>
>>>>>>>> users-unsubscribe@pdfbox.apache.org>
>>>>>>>>
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>> <mailto:
>>>>> users-help@pdfbox.apache.org>
>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by John Hewson <jo...@jahewson.com>.
> On 27 May 2015, at 11:02, Balaji Venkatamohan <bv...@tibco.com> wrote:
> 
> Thanks Tilman for letting the website developer know about the shortcomings
> of their compression technique.
> 
> The PDF owner did not share with us the information about which website
> they used for compressing the PDF. My teammates helped in identifying this
> website. I will let the customer know about this particular website and
> will leave it to them regarding continuing to use this website for their
> PDF documents.
> 
> Could you also answer the following question please?
> Would Pdfbox API change its code to accommodate the incorrect condition
> that annotation fields (editable fields) are outside acro form fields as
> well? I know the PDF compressed by the website is incorrect and hence I
> would understand if you don't go ahead with this.

It wouldn’t be too hard to write some code using PDFBox’s COS API which
can repair such PDFs. However, it wouldn’t be a suitable for inclusion in
PDFBox itself, as the compressed files really don’t contain forms and it would
be incorrect if PDFBox could read them.

— John

> Thanks,
> Balaji
> 
> 
> On Tue, May 26, 2015 at 10:45 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
> 
>> I just tested it. It also removes /Outlines and /Metadata and more
>> important data from PDF files.
>> 
>> So your client can't share the PDF with us, but he shared it some website.
>> 
>> A little research shows that this website is owned by Lauri Lehtinen from
>> Talinn, Estonia.
>> http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
>> https://www.linkedin.com/in/laurilehtinen
>> https://twitter.com/laurii
>> 
>> I also tweeted him.
>> 
>> Tilman
>> 
>> 
>> Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:
>> 
>>> Okay, I found out the online tool used by the customer to compress their
>>> PDF.
>>> 
>>> It is : https://www.pdfcompress.com/
>>> 
>>> I don't need to rely on the PDF sent by the customer because all PDFs that
>>> are available on the web, are compressed in the same manner by this tool,
>>> that is, it gets rid of all acro form fields during compression.
>>> 
>>> For example, the f941 govt form available at this site:
>>> http://www.irs.gov/pub/irs-pdf/f941.pdf
>>> If we compress this using the online tool, the resultant file size is very
>>> low, which is good. However, there are no acro form fields in the
>>> compressed PDF.
>>> 
>>> Thanks,
>>> Balaji
>>> 
>>> 
>>> 
>>> On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sa...@fileaffairs.de>
>>> wrote:
>>> 
>>> Hi,
>>>> 
>>>> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bvenkata@tibco.com
>>>>>> :
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> So AcroForms/Fields is an empty Array?
>>>>> 
>>>>> Yes, in the filled interview_compressed.pdf, the acroforms are not null
>>>>> 
>>>> but
>>>> 
>>>>> empty. Size of array is zero.
>>>>> 
>>>>> Also, I tried qpdf command line tool to compress the file interview.pdf
>>>>> 
>>>> and
>>>> 
>>>>> the resultant compressed file size of 1.6MB was no way near the file
>>>>> size
>>>>> of interview_compressed.pdf (21 KB).
>>>>> 
>>>> would you think it's possible to get a similar PDF file or permission to
>>>> use it internally so we have a sample to look at a potential fix.
>>>> 
>>>> Although the PDF is not inline with the spec as Acrobat is able to handle
>>>> it we could look into getting a similar result.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>> Thanks,
>>>>> Balaji
>>>>> 
>>>>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <
>>>>> sahyoun@fileaffairs.de
>>>>> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>>> 
>>>>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <
>>>>>>> bvenkata@tibco.com
>>>>>>> 
>>>>>> :
>>>>> 
>>>>>> I opened the interview_compressed in notepad++ and did not see any
>>>>>>> 'Acroform' text anywhere.
>>>>>>> However, as Maruan suggested, I entered some data into what looks like
>>>>>>> 
>>>>>> form
>>>>>> 
>>>>>>> fields of interview_compressed.pdf and saved it. When I opened this
>>>>>>> 
>>>>>> file
>>>> 
>>>>> in
>>>>>> 
>>>>>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
>>>>>>> 
>>>>>> in
>>>> 
>>>>> file size from 21 KB to ~530 KB.
>>>>>>> 
>>>>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and
>>>>>>> saw
>>>>>>> that the field values were getting stored but not under Acroform
>>>>>>> fields
>>>>>>> 
>>>>>> but
>>>>>> 
>>>>>>> under Annotations.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> So AcroForms/Fields is an empty Array?
>>>>>> 
>>>>>> Please refer to this image:
>>>>>>> 
>>>>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>>>>>> 
>>>>>>> So, whatever the compression technique was, it simply made all the
>>>>>>> 
>>>>>> Acroform
>>>>>> 
>>>>>>> fields disappear from the original PDF but retained all annotations
>>>>>>> 
>>>>>> which
>>>> 
>>>>> also contain the interactive forms and this helped reduce the file size
>>>>>>> 
>>>>>> so
>>>>>> 
>>>>>>> much? If this is the case, can pdfbox API also use similar compression
>>>>>>> technique to compress such a a huge file into a smaller one?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
>>>>>>> 
>>>>>> sahyoun@fileaffairs.de>
>>>> 
>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
>>>>>>>>> 
>>>>>>>> THausherr@t-online.de
>>>> 
>>>>> :
>>>>>>> 
>>>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> I used PdfDebugger to make the internal PDF structure of the two
>>>>>>>>>> 
>>>>>>>>> files
>>>> 
>>>>> (1)
>>>>>>>> 
>>>>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>>>>>>>>>> 
>>>>>>>>> and I
>>>>>> 
>>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>>>>>> 
>>>>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>>>>>> 
>>>>>>>>>> The first two links are from the internal structure of
>>>>>>>>>> interview.pdf
>>>>>>>>>> (original uncompressed file)
>>>>>>>>>> The third and fourth links are from the internal structure of
>>>>>>>>>> interview_compressed.pdf (compressed file)
>>>>>>>>>> The fifth link compares the file sizes of the two files and as you
>>>>>>>>>> 
>>>>>>>>> can
>>>> 
>>>>> also
>>>>>>>> 
>>>>>>>>> see, the difference is huge.
>>>>>>>>>> 
>>>>>>>>>> As you might notice, the file interview_compressed.pdf has no
>>>>>>>>>> 
>>>>>>>>> acroform
>>>> 
>>>>> Indeed... but this is needed - from the spec:
>>>>>>>>> 
>>>>>>>>> "The contents and properties of a document’s interactive form shall
>>>>>>>>> 
>>>>>>>> be
>>>> 
>>>>> defined by an interactive form dictionary that shall be referenced
>>>>>>>> 
>>>>>>> from
>>>> 
>>>>> the
>>>>>> 
>>>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>>>>>>>> 
>>>>>>> Catalog”).
>>>>>> 
>>>>>>> Table 218 shows the contents of this dictionary."
>>>>>>>> correct
>>>>>>>> 
>>>>>>>> fields listed even though opening the PDF in pdf reader allows me to
>>>>>>>>>> 
>>>>>>>>> enter
>>>>>>>> 
>>>>>>>>> values in places which look like AcroForm fields and also save them.
>>>>>>>>>> 
>>>>>>>>> Are
>>>>>> 
>>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>>>>>>>> 
>>>>>>>>> enable
>>>>>>>> 
>>>>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>>>>>>>> 
>>>>>>>>> having
>>>>>>>> 
>>>>>>>>> to go through PDAcrofield?
>>>>>>>>>> 
>>>>>>>>> Yes, annotations... there are some common parts, but this is just a
>>>>>>>>> 
>>>>>>>> vague observation from me, I'm not the acroform specialist.
>>>>>>>> 
>>>>>>>> from a first glance it looks like there are all entries necessary to
>>>>>>>> 
>>>>>>> (re-)
>>>>>> 
>>>>>>> generate the form fields. That's what's likely happening for this
>>>>>>>> 
>>>>>>> document
>>>>>> 
>>>>>>> in Adobe Reader. Would be interesting to see what's being save after
>>>>>>>> 
>>>>>>> the
>>>> 
>>>>> forms has been filled out and saved using Acrobat. We'd need a test
>>>>>>>> 
>>>>>>> form to
>>>>>> 
>>>>>>> come up with an enhancement like this.
>>>>>>>> 
>>>>>>>> BR
>>>>>>>> Maruan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> What you should do: use NOTEPAD++ to look whether there's
>>>>>>>>> "/AcroForm"
>>>>>>>>> 
>>>>>>>> in
>>>>>> 
>>>>>>> the "compressed" file.
>>>>>>>> 
>>>>>>>>> - if it is missing, tell the client (or your boss) just that
>>>>>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>>>>>>>>> 
>>>>>>>> the
>>>>>> 
>>>>>>> loadNonSeq I mentioned earlier)
>>>>>>>> 
>>>>>>>>> Tilman
>>>>>>>>> 
>>>>>>>>> You can use qpdf , then use these options:
>>>>>>>>>> 
>>>>>>>>>> I will now try using this link to compress the original file.
>>>>>>>>>> 
>>>>>>>>>> Another strategy to think about - can your client generate a
>>>>>>>>>> non-confidential file, so that you can share it, and the
>>>>>>>>>> 
>>>>>>>>> "compressed"
>>>> 
>>>>> file?
>>>>>>>> 
>>>>>>>>> I wish I had direct communication with the clients but due to
>>>>>>>>>> 
>>>>>>>>> bureaucracy,
>>>>>>>> 
>>>>>>>>> I am having to go through multiple layers to get my message across
>>>>>>>>>> 
>>>>>>>>> to
>>>> 
>>>>> them.
>>>>>>>> 
>>>>>>>>> I will share more information as soon as I have them.
>>>>>>>>>> 
>>>>>>>>>> PS: i sent these image links to my personal email first to make
>>>>>>>>>> sure
>>>>>>>>>> 
>>>>>>>>> that I
>>>>>>>> 
>>>>>>>>> can open them. I could and so I am hoping you all could too. If you
>>>>>>>>>> 
>>>>>>>>> are
>>>>>> 
>>>>>>> unable to open them, please let me know.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Balaji
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>>>>>>>>>> 
>>>>>>>>> THausherr@t-online.de
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
>>>>>>>>>>>> 
>>>>>>>>>>> 03:24
>>>>>>>> 
>>>>>>>>> geschrieben:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>>>>>>>>>>> 
>>>>>>>>>>>> attaching it
>>>>>>>> 
>>>>>>>>> with this email.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The point I am trying to make is that the PDF, which was
>>>>>>>>>>>>> 
>>>>>>>>>>>> decompressed
>>>>>> 
>>>>>>> using
>>>>>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
>>>>>>>>>>>>> 
>>>>>>>>>>>> to
>>>> 
>>>>> us by
>>>>>>>> 
>>>>>>>>> our customers.
>>>>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of
>>>>>>>>>>>>> 
>>>>>>>>>>>> PDFBox
>>>> 
>>>>> did
>>>>>>>> 
>>>>>>>>> not
>>>>>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to
>>>>>>>>>>>>> 
>>>>>>>>>>>> us
>>>> 
>>>>> by
>>>>>>>> 
>>>>>>>>> the
>>>>>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know
>>>>>>>>>>>>> 
>>>>>>>>>>>> how
>>>> 
>>>>> to
>>>>>> 
>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I
>>>>>>>>>>>>> 
>>>>>>>>>>>> was
>>>> 
>>>>> analyzing COSStream was to check if the decompression of the
>>>>>>>>>>>>> 
>>>>>>>>>>>> compressed
>>>>>>>> 
>>>>>>>>> PDF
>>>>>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>>>>>> I know it would have been difficult for you to help me without
>>>>>>>>>>>>> 
>>>>>>>>>>>> the
>>>> 
>>>>> actual
>>>>>>>> 
>>>>>>>>> PDFs. For that, I would like to thank you for your time and
>>>>>>>>>>>>> 
>>>>>>>>>>>> pointers.
>>>>>> 
>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>>>>>>>>>>>> 
>>>>>>>>>>> both
>>>>>>>> 
>>>>>>>>> files
>>>>>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>>>>>>>>>> 
>>>>>>>>>>> screenshot
>>>>>>>> 
>>>>>>>>> of both
>>>>>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>>>>>>>>>>>> 
>>>>>>>>>>> could
>>>>>> 
>>>>>>> shed some
>>>>>>>>>>>> light on your issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> @Balaji: here's an example on how such a screenshot would look
>>>>>>>>>>> 
>>>>>>>>>> like:
>>>> 
>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>>>>>> 
>>>>>>>>>>> Tilman
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> BR
>>>>>>>>>>>> Andreas Lehmkühler
>>>>>>>>>>>> 
>>>>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>>>>>>>>>> 
>>>>>>>>>>> THausherr@t-online.de>
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is all very confusing... /acroform is in the document
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> catalog.
>>>>>> 
>>>>>>> I
>>>>>>>> 
>>>>>>>>> don't see how the page content stream is related to it. The best
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> is
>>>>>> 
>>>>>>> that
>>>>>>>> 
>>>>>>>>> you either go through the source code, or read the spec and then
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> look at
>>>>>>>> 
>>>>>>>>> the pdf.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> To find out what's going on, you'd have to start from that
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> /acroform
>>>>>> 
>>>>>>> entry
>>>>>>>>>>>>>> and then compare the two files.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It is really difficult to help you without the files. The cause
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> could
>>>>>>>> 
>>>>>>>>> be a
>>>>>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Some more ideas:
>>>>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>>>>>> - try the unreleased 2.0 version, that one has some
>>>>>>>>>>>>>> improvements
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> in
>>>>>> 
>>>>>>> the
>>>>>>>> 
>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> smallest
>>>>>>>> 
>>>>>>>>> possible code that fails, and 2) post a small part of the raw
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> PDF,
>>>> 
>>>>> i.e.
>>>>>>>> 
>>>>>>>>> the
>>>>>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Tilman
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> pages), I
>>>>>>>> 
>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>>>>>            pdStream=firstPage.getContents();
>>>>>>>>>>>>>>>            COSStream stream=pdStream.getStream();
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> debug
>>>>>>>> 
>>>>>>>>> mode, has the following:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> From this point on, using the COSStream object for every
>>>>>>>>>>>>>>> page,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> how
>>>>>> 
>>>>>>> can I
>>>>>>>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>>>>>> unFilteredStream
>>>>>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    Thank you for your response Tilman.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    I had previously tried using the WriteDecodedDoc for my
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> compressed
>>>>>>>> 
>>>>>>>>>    PDF and I tried to get the number of acro form fields
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> present
>>>> 
>>>>> in
>>>>>>>> 
>>>>>>>>> the output file generated by WriteDecodedDoc. The API still
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> could
>>>>>> 
>>>>>>>    not find the acro form fields in the generated decompressed
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> file.
>>>>>>>> 
>>>>>>>>>     Also the decompressed file generated is 75 KB which is far
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> less
>>>>>>>> 
>>>>>>>>>    than the original decompressed file which I have (1.6 MB)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> though I
>>>>>>>> 
>>>>>>>>>    could edit the acro form fields using acrobat reader.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    Thanks,
>>>>>>>>>>>>>>>    Balaji
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>>>>>    <THausherr@t-online.de <ma...@t-online.de>>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>> 
>>>>>>>        Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>            My question is: how do I flatedecode a PDF so
>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> can
>>>>>>>> 
>>>>>>>>>            find all the
>>>>>>>>>>>>>>>            acroform fields within it. ANy help or pointers
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> would
>>>> 
>>>>> be
>>>>>>>> 
>>>>>>>>>            highly appreciated.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>        You could try the WriteDecodedDoc option of the
>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> line
>>>>>>>> 
>>>>>>>>> app
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>> 
>>>>>>>>>        Maybe you can have further ideas by comparing the two
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> files
>>>>>>>> 
>>>>>>>>>        with NOTEPAD++.... however the two files might have
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> their
>>>> 
>>>>>        objects in different order.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>        Tilman
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> 
>>>>>>>>>        To unsubscribe, e-mail:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>> 
>>>>>>>>>        <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>>>        For additional commands, e-mail:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> users-help@pdfbox.apache.org
>>>>>>>> 
>>>>>>>>>        <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> 
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> 
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>> 
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> <mailto:
>>>>>>>>> 
>>>>>>>> users-unsubscribe@pdfbox.apache.org>
>>>>>>>> 
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>> 
>>>>>>>> <mailto:
>>>> 
>>>>> users-help@pdfbox.apache.org>
>>>>>>>> 
>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Thanks Tilman for letting the website developer know about the shortcomings
of their compression technique.

The PDF owner did not share with us the information about which website
they used for compressing the PDF. My teammates helped in identifying this
website. I will let the customer know about this particular website and
will leave it to them regarding continuing to use this website for their
PDF documents.

Could you also answer the following question please?
Would Pdfbox API change its code to accommodate the incorrect condition
that annotation fields (editable fields) are outside acro form fields as
well? I know the PDF compressed by the website is incorrect and hence I
would understand if you don't go ahead with this.

Thanks,
Balaji


On Tue, May 26, 2015 at 10:45 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> I just tested it. It also removes /Outlines and /Metadata and more
> important data from PDF files.
>
> So your client can't share the PDF with us, but he shared it some website.
>
> A little research shows that this website is owned by Lauri Lehtinen from
> Talinn, Estonia.
> http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
> https://www.linkedin.com/in/laurilehtinen
> https://twitter.com/laurii
>
> I also tweeted him.
>
> Tilman
>
>
> Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:
>
>> Okay, I found out the online tool used by the customer to compress their
>> PDF.
>>
>> It is : https://www.pdfcompress.com/
>>
>> I don't need to rely on the PDF sent by the customer because all PDFs that
>> are available on the web, are compressed in the same manner by this tool,
>> that is, it gets rid of all acro form fields during compression.
>>
>> For example, the f941 govt form available at this site:
>> http://www.irs.gov/pub/irs-pdf/f941.pdf
>> If we compress this using the online tool, the resultant file size is very
>> low, which is good. However, there are no acro form fields in the
>> compressed PDF.
>>
>> Thanks,
>> Balaji
>>
>>
>>
>> On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sa...@fileaffairs.de>
>> wrote:
>>
>>  Hi,
>>>
>>>  Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bvenkata@tibco.com
>>>> >:
>>>>
>>>> Hi,
>>>>
>>>> So AcroForms/Fields is an empty Array?
>>>>
>>>> Yes, in the filled interview_compressed.pdf, the acroforms are not null
>>>>
>>> but
>>>
>>>> empty. Size of array is zero.
>>>>
>>>> Also, I tried qpdf command line tool to compress the file interview.pdf
>>>>
>>> and
>>>
>>>> the resultant compressed file size of 1.6MB was no way near the file
>>>> size
>>>> of interview_compressed.pdf (21 KB).
>>>>
>>> would you think it's possible to get a similar PDF file or permission to
>>> use it internally so we have a sample to look at a potential fix.
>>>
>>> Although the PDF is not inline with the spec as Acrobat is able to handle
>>> it we could look into getting a similar result.
>>>
>>> BR
>>> Maruan
>>>
>>>
>>>  Thanks,
>>>> Balaji
>>>>
>>>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <
>>>> sahyoun@fileaffairs.de
>>>>
>>>> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>>  Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <
>>>>>> bvenkata@tibco.com
>>>>>>
>>>>> :
>>>>
>>>>> I opened the interview_compressed in notepad++ and did not see any
>>>>>> 'Acroform' text anywhere.
>>>>>> However, as Maruan suggested, I entered some data into what looks like
>>>>>>
>>>>> form
>>>>>
>>>>>> fields of interview_compressed.pdf and saved it. When I opened this
>>>>>>
>>>>> file
>>>
>>>> in
>>>>>
>>>>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
>>>>>>
>>>>> in
>>>
>>>> file size from 21 KB to ~530 KB.
>>>>>>
>>>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and
>>>>>> saw
>>>>>> that the field values were getting stored but not under Acroform
>>>>>> fields
>>>>>>
>>>>> but
>>>>>
>>>>>> under Annotations.
>>>>>>
>>>>>
>>>>>
>>>>> So AcroForms/Fields is an empty Array?
>>>>>
>>>>>  Please refer to this image:
>>>>>>
>>>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>>>>>
>>>>>> So, whatever the compression technique was, it simply made all the
>>>>>>
>>>>> Acroform
>>>>>
>>>>>> fields disappear from the original PDF but retained all annotations
>>>>>>
>>>>> which
>>>
>>>> also contain the interactive forms and this helped reduce the file size
>>>>>>
>>>>> so
>>>>>
>>>>>> much? If this is the case, can pdfbox API also use similar compression
>>>>>> technique to compress such a a huge file into a smaller one?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
>>>>>>
>>>>> sahyoun@fileaffairs.de>
>>>
>>>> wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>>
>>>>>>>  Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
>>>>>>>>
>>>>>>> THausherr@t-online.de
>>>
>>>> :
>>>>>>
>>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I used PdfDebugger to make the internal PDF structure of the two
>>>>>>>>>
>>>>>>>> files
>>>
>>>> (1)
>>>>>>>
>>>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>>>>>>>>>
>>>>>>>> and I
>>>>>
>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>>>>>
>>>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>>>>>
>>>>>>>>> The first two links are from the internal structure of
>>>>>>>>> interview.pdf
>>>>>>>>> (original uncompressed file)
>>>>>>>>> The third and fourth links are from the internal structure of
>>>>>>>>> interview_compressed.pdf (compressed file)
>>>>>>>>> The fifth link compares the file sizes of the two files and as you
>>>>>>>>>
>>>>>>>> can
>>>
>>>> also
>>>>>>>
>>>>>>>> see, the difference is huge.
>>>>>>>>>
>>>>>>>>> As you might notice, the file interview_compressed.pdf has no
>>>>>>>>>
>>>>>>>> acroform
>>>
>>>> Indeed... but this is needed - from the spec:
>>>>>>>>
>>>>>>>> "The contents and properties of a document’s interactive form shall
>>>>>>>>
>>>>>>> be
>>>
>>>> defined by an interactive form dictionary that shall be referenced
>>>>>>>
>>>>>> from
>>>
>>>> the
>>>>>
>>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>>>>>>>
>>>>>> Catalog”).
>>>>>
>>>>>> Table 218 shows the contents of this dictionary."
>>>>>>> correct
>>>>>>>
>>>>>>>  fields listed even though opening the PDF in pdf reader allows me to
>>>>>>>>>
>>>>>>>> enter
>>>>>>>
>>>>>>>> values in places which look like AcroForm fields and also save them.
>>>>>>>>>
>>>>>>>> Are
>>>>>
>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>>>>>>>
>>>>>>>> enable
>>>>>>>
>>>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>>>>>>>
>>>>>>>> having
>>>>>>>
>>>>>>>> to go through PDAcrofield?
>>>>>>>>>
>>>>>>>> Yes, annotations... there are some common parts, but this is just a
>>>>>>>>
>>>>>>> vague observation from me, I'm not the acroform specialist.
>>>>>>>
>>>>>>> from a first glance it looks like there are all entries necessary to
>>>>>>>
>>>>>> (re-)
>>>>>
>>>>>> generate the form fields. That's what's likely happening for this
>>>>>>>
>>>>>> document
>>>>>
>>>>>> in Adobe Reader. Would be interesting to see what's being save after
>>>>>>>
>>>>>> the
>>>
>>>> forms has been filled out and saved using Acrobat. We'd need a test
>>>>>>>
>>>>>> form to
>>>>>
>>>>>> come up with an enhancement like this.
>>>>>>>
>>>>>>> BR
>>>>>>> Maruan
>>>>>>>
>>>>>>>
>>>>>>>  What you should do: use NOTEPAD++ to look whether there's
>>>>>>>> "/AcroForm"
>>>>>>>>
>>>>>>> in
>>>>>
>>>>>> the "compressed" file.
>>>>>>>
>>>>>>>> - if it is missing, tell the client (or your boss) just that
>>>>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> loadNonSeq I mentioned earlier)
>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>  You can use qpdf , then use these options:
>>>>>>>>>
>>>>>>>>> I will now try using this link to compress the original file.
>>>>>>>>>
>>>>>>>>> Another strategy to think about - can your client generate a
>>>>>>>>> non-confidential file, so that you can share it, and the
>>>>>>>>>
>>>>>>>> "compressed"
>>>
>>>> file?
>>>>>>>
>>>>>>>> I wish I had direct communication with the clients but due to
>>>>>>>>>
>>>>>>>> bureaucracy,
>>>>>>>
>>>>>>>> I am having to go through multiple layers to get my message across
>>>>>>>>>
>>>>>>>> to
>>>
>>>> them.
>>>>>>>
>>>>>>>> I will share more information as soon as I have them.
>>>>>>>>>
>>>>>>>>> PS: i sent these image links to my personal email first to make
>>>>>>>>> sure
>>>>>>>>>
>>>>>>>> that I
>>>>>>>
>>>>>>>> can open them. I could and so I am hoping you all could too. If you
>>>>>>>>>
>>>>>>>> are
>>>>>
>>>>>> unable to open them, please let me know.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Balaji
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>>>>>>>>>
>>>>>>>> THausherr@t-online.de
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>>
>>>>>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
>>>>>>>>>>>
>>>>>>>>>> 03:24
>>>>>>>
>>>>>>>> geschrieben:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>>>>>>>>>>
>>>>>>>>>>> attaching it
>>>>>>>
>>>>>>>> with this email.
>>>>>>>>>>>>
>>>>>>>>>>>> The point I am trying to make is that the PDF, which was
>>>>>>>>>>>>
>>>>>>>>>>> decompressed
>>>>>
>>>>>> using
>>>>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>
>>>> us by
>>>>>>>
>>>>>>>> our customers.
>>>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of
>>>>>>>>>>>>
>>>>>>>>>>> PDFBox
>>>
>>>> did
>>>>>>>
>>>>>>>> not
>>>>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to
>>>>>>>>>>>>
>>>>>>>>>>> us
>>>
>>>> by
>>>>>>>
>>>>>>>> the
>>>>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know
>>>>>>>>>>>>
>>>>>>>>>>> how
>>>
>>>> to
>>>>>
>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I
>>>>>>>>>>>>
>>>>>>>>>>> was
>>>
>>>> analyzing COSStream was to check if the decompression of the
>>>>>>>>>>>>
>>>>>>>>>>> compressed
>>>>>>>
>>>>>>>> PDF
>>>>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>>>>> I know it would have been difficult for you to help me without
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>
>>>> actual
>>>>>>>
>>>>>>>> PDFs. For that, I would like to thank you for your time and
>>>>>>>>>>>>
>>>>>>>>>>> pointers.
>>>>>
>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>>>>>>>>>>>
>>>>>>>>>> both
>>>>>>>
>>>>>>>> files
>>>>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>>>>>>>>>
>>>>>>>>>> screenshot
>>>>>>>
>>>>>>>> of both
>>>>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>>>>>>>>>>>
>>>>>>>>>> could
>>>>>
>>>>>> shed some
>>>>>>>>>>> light on your issue.
>>>>>>>>>>>
>>>>>>>>>>>  @Balaji: here's an example on how such a screenshot would look
>>>>>>>>>>
>>>>>>>>> like:
>>>
>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  BR
>>>>>>>>>>> Andreas Lehmkühler
>>>>>>>>>>>
>>>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>>>>>>>>>
>>>>>>>>>> THausherr@t-online.de>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is all very confusing... /acroform is in the document
>>>>>>>>>>>>>
>>>>>>>>>>>> catalog.
>>>>>
>>>>>> I
>>>>>>>
>>>>>>>> don't see how the page content stream is related to it. The best
>>>>>>>>>>>>>
>>>>>>>>>>>> is
>>>>>
>>>>>> that
>>>>>>>
>>>>>>>> you either go through the source code, or read the spec and then
>>>>>>>>>>>>>
>>>>>>>>>>>> look at
>>>>>>>
>>>>>>>> the pdf.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To find out what's going on, you'd have to start from that
>>>>>>>>>>>>>
>>>>>>>>>>>> /acroform
>>>>>
>>>>>> entry
>>>>>>>>>>>>> and then compare the two files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is really difficult to help you without the files. The cause
>>>>>>>>>>>>>
>>>>>>>>>>>> could
>>>>>>>
>>>>>>>> be a
>>>>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some more ideas:
>>>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>>>>> - try the unreleased 2.0 version, that one has some
>>>>>>>>>>>>> improvements
>>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>>>>>>>>>>>
>>>>>>>>>>>> smallest
>>>>>>>
>>>>>>>> possible code that fails, and 2) post a small part of the raw
>>>>>>>>>>>>>
>>>>>>>>>>>> PDF,
>>>
>>>> i.e.
>>>>>>>
>>>>>>>> the
>>>>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tilman
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>>>>>>>>>>>
>>>>>>>>>>>> pages), I
>>>>>>>
>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>>>>             pdStream=firstPage.getContents();
>>>>>>>>>>>>>>             COSStream stream=pdStream.getStream();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> debug
>>>>>>>
>>>>>>>> mode, has the following:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  From this point on, using the COSStream object for every
>>>>>>>>>>>>>> page,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> how
>>>>>
>>>>>> can I
>>>>>>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>>>>> unFilteredStream
>>>>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>>>>> ​
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     Thank you for your response Tilman.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     I had previously tried using the WriteDecodedDoc for my
>>>>>>>>>>>>>>
>>>>>>>>>>>>> compressed
>>>>>>>
>>>>>>>>     PDF and I tried to get the number of acro form fields
>>>>>>>>>>>>>>
>>>>>>>>>>>>> present
>>>
>>>> in
>>>>>>>
>>>>>>>> the output file generated by WriteDecodedDoc. The API still
>>>>>>>>>>>>>>
>>>>>>>>>>>>> could
>>>>>
>>>>>>     not find the acro form fields in the generated decompressed
>>>>>>>>>>>>>>
>>>>>>>>>>>>> file.
>>>>>>>
>>>>>>>>      Also the decompressed file generated is 75 KB which is far
>>>>>>>>>>>>>>
>>>>>>>>>>>>> less
>>>>>>>
>>>>>>>>     than the original decompressed file which I have (1.6 MB)
>>>>>>>>>>>>>>
>>>>>>>>>>>>> though I
>>>>>>>
>>>>>>>>     could edit the acro form fields using acrobat reader.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     Thanks,
>>>>>>>>>>>>>>     Balaji
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>>>>     <THausherr@t-online.de <ma...@t-online.de>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>
>>>>>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>             My question is: how do I flatedecode a PDF so
>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>
>>>>>>>>>>>>> can
>>>>>>>
>>>>>>>>             find all the
>>>>>>>>>>>>>>             acroform fields within it. ANy help or pointers
>>>>>>>>>>>>>>
>>>>>>>>>>>>> would
>>>
>>>> be
>>>>>>>
>>>>>>>>             highly appreciated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         You could try the WriteDecodedDoc option of the
>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>
>>>>>>>>>>>>> line
>>>>>>>
>>>>>>>> app
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>
>>>>>>>>         Maybe you can have further ideas by comparing the two
>>>>>>>>>>>>>>
>>>>>>>>>>>>> files
>>>>>>>
>>>>>>>>         with NOTEPAD++.... however the two files might have
>>>>>>>>>>>>>>
>>>>>>>>>>>>> their
>>>
>>>>         objects in different order.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         Tilman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>>>         To unsubscribe, e-mail:
>>>>>>>>>>>>>>
>>>>>>>>>>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>
>>>>>>>>         <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>>         For additional commands, e-mail:
>>>>>>>>>>>>>>
>>>>>>>>>>>>> users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>>         <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>> ---------------------------------------------------------------------
>>>
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> <mailto:
>>>>>>>>
>>>>>>> users-unsubscribe@pdfbox.apache.org>
>>>>>>>
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>> <mailto:
>>>
>>>> users-help@pdfbox.apache.org>
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
I just tested it. It also removes /Outlines and /Metadata and more 
important data from PDF files.

So your client can't share the PDF with us, but he shared it some website.

A little research shows that this website is owned by Lauri Lehtinen 
from Talinn, Estonia.
http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=pdfcompress.com
https://www.linkedin.com/in/laurilehtinen
https://twitter.com/laurii

I also tweeted him.

Tilman

Am 27.05.2015 um 03:06 schrieb Balaji Venkatamohan:
> Okay, I found out the online tool used by the customer to compress their
> PDF.
>
> It is : https://www.pdfcompress.com/
>
> I don't need to rely on the PDF sent by the customer because all PDFs that
> are available on the web, are compressed in the same manner by this tool,
> that is, it gets rid of all acro form fields during compression.
>
> For example, the f941 govt form available at this site:
> http://www.irs.gov/pub/irs-pdf/f941.pdf
> If we compress this using the online tool, the resultant file size is very
> low, which is good. However, there are no acro form fields in the
> compressed PDF.
>
> Thanks,
> Balaji
>
>
>
> On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
>
>> Hi,
>>
>>> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bv...@tibco.com>:
>>>
>>> Hi,
>>>
>>> So AcroForms/Fields is an empty Array?
>>>
>>> Yes, in the filled interview_compressed.pdf, the acroforms are not null
>> but
>>> empty. Size of array is zero.
>>>
>>> Also, I tried qpdf command line tool to compress the file interview.pdf
>> and
>>> the resultant compressed file size of 1.6MB was no way near the file size
>>> of interview_compressed.pdf (21 KB).
>> would you think it's possible to get a similar PDF file or permission to
>> use it internally so we have a sample to look at a potential fix.
>>
>> Although the PDF is not inline with the spec as Acrobat is able to handle
>> it we could look into getting a similar result.
>>
>> BR
>> Maruan
>>
>>
>>> Thanks,
>>> Balaji
>>>
>>> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sahyoun@fileaffairs.de
>>>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bvenkata@tibco.com
>>> :
>>>>> I opened the interview_compressed in notepad++ and did not see any
>>>>> 'Acroform' text anywhere.
>>>>> However, as Maruan suggested, I entered some data into what looks like
>>>> form
>>>>> fields of interview_compressed.pdf and saved it. When I opened this
>> file
>>>> in
>>>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
>> in
>>>>> file size from 21 KB to ~530 KB.
>>>>>
>>>>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
>>>>> that the field values were getting stored but not under Acroform fields
>>>> but
>>>>> under Annotations.
>>>>
>>>>
>>>> So AcroForms/Fields is an empty Array?
>>>>
>>>>> Please refer to this image:
>>>>>
>>>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>>>>
>>>>> So, whatever the compression technique was, it simply made all the
>>>> Acroform
>>>>> fields disappear from the original PDF but retained all annotations
>> which
>>>>> also contain the interactive forms and this helped reduce the file size
>>>> so
>>>>> much? If this is the case, can pdfbox API also use similar compression
>>>>> technique to compress such a a huge file into a smaller one?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
>> THausherr@t-online.de
>>>>> :
>>>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I used PdfDebugger to make the internal PDF structure of the two
>> files
>>>>>> (1)
>>>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>>>> and I
>>>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>>>>
>>>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>>>>
>>>>>>>> The first two links are from the internal structure of interview.pdf
>>>>>>>> (original uncompressed file)
>>>>>>>> The third and fourth links are from the internal structure of
>>>>>>>> interview_compressed.pdf (compressed file)
>>>>>>>> The fifth link compares the file sizes of the two files and as you
>> can
>>>>>> also
>>>>>>>> see, the difference is huge.
>>>>>>>>
>>>>>>>> As you might notice, the file interview_compressed.pdf has no
>> acroform
>>>>>>> Indeed... but this is needed - from the spec:
>>>>>>>
>>>>>>> "The contents and properties of a document’s interactive form shall
>> be
>>>>>> defined by an interactive form dictionary that shall be referenced
>> from
>>>> the
>>>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>>>> Catalog”).
>>>>>> Table 218 shows the contents of this dictionary."
>>>>>> correct
>>>>>>
>>>>>>>> fields listed even though opening the PDF in pdf reader allows me to
>>>>>> enter
>>>>>>>> values in places which look like AcroForm fields and also save them.
>>>> Are
>>>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>>>> enable
>>>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>>>> having
>>>>>>>> to go through PDAcrofield?
>>>>>>> Yes, annotations... there are some common parts, but this is just a
>>>>>> vague observation from me, I'm not the acroform specialist.
>>>>>>
>>>>>> from a first glance it looks like there are all entries necessary to
>>>> (re-)
>>>>>> generate the form fields. That's what's likely happening for this
>>>> document
>>>>>> in Adobe Reader. Would be interesting to see what's being save after
>> the
>>>>>> forms has been filled out and saved using Acrobat. We'd need a test
>>>> form to
>>>>>> come up with an enhancement like this.
>>>>>>
>>>>>> BR
>>>>>> Maruan
>>>>>>
>>>>>>
>>>>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
>>>> in
>>>>>> the "compressed" file.
>>>>>>> - if it is missing, tell the client (or your boss) just that
>>>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>>>> the
>>>>>> loadNonSeq I mentioned earlier)
>>>>>>> Tilman
>>>>>>>
>>>>>>>> You can use qpdf , then use these options:
>>>>>>>>
>>>>>>>> I will now try using this link to compress the original file.
>>>>>>>>
>>>>>>>> Another strategy to think about - can your client generate a
>>>>>>>> non-confidential file, so that you can share it, and the
>> "compressed"
>>>>>> file?
>>>>>>>> I wish I had direct communication with the clients but due to
>>>>>> bureaucracy,
>>>>>>>> I am having to go through multiple layers to get my message across
>> to
>>>>>> them.
>>>>>>>> I will share more information as soon as I have them.
>>>>>>>>
>>>>>>>> PS: i sent these image links to my personal email first to make sure
>>>>>> that I
>>>>>>>> can open them. I could and so I am hoping you all could too. If you
>>>> are
>>>>>>>> unable to open them, please let me know.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Balaji
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>>>> THausherr@t-online.de
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
>>>>>> 03:24
>>>>>>>>>>> geschrieben:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>>>> attaching it
>>>>>>>>>>> with this email.
>>>>>>>>>>>
>>>>>>>>>>> The point I am trying to make is that the PDF, which was
>>>> decompressed
>>>>>>>>>>> using
>>>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
>> to
>>>>>> us by
>>>>>>>>>>> our customers.
>>>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of
>> PDFBox
>>>>>> did
>>>>>>>>>>> not
>>>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to
>> us
>>>>>> by
>>>>>>>>>>> the
>>>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know
>> how
>>>> to
>>>>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I
>> was
>>>>>>>>>>> analyzing COSStream was to check if the decompression of the
>>>>>> compressed
>>>>>>>>>>> PDF
>>>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>>>> I know it would have been difficult for you to help me without
>> the
>>>>>> actual
>>>>>>>>>>> PDFs. For that, I would like to thank you for your time and
>>>> pointers.
>>>>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>>>>>> both
>>>>>>>>>> files
>>>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>>>> screenshot
>>>>>>>>>> of both
>>>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>>>> could
>>>>>>>>>> shed some
>>>>>>>>>> light on your issue.
>>>>>>>>>>
>>>>>>>>> @Balaji: here's an example on how such a screenshot would look
>> like:
>>>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> BR
>>>>>>>>>> Andreas Lehmkühler
>>>>>>>>>>
>>>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>>>>
>>>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>>>> THausherr@t-online.de>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>>>>
>>>>>>>>>>>> This is all very confusing... /acroform is in the document
>>>> catalog.
>>>>>> I
>>>>>>>>>>>> don't see how the page content stream is related to it. The best
>>>> is
>>>>>> that
>>>>>>>>>>>> you either go through the source code, or read the spec and then
>>>>>> look at
>>>>>>>>>>>> the pdf.
>>>>>>>>>>>>
>>>>>>>>>>>> To find out what's going on, you'd have to start from that
>>>> /acroform
>>>>>>>>>>>> entry
>>>>>>>>>>>> and then compare the two files.
>>>>>>>>>>>>
>>>>>>>>>>>> It is really difficult to help you without the files. The cause
>>>>>> could
>>>>>>>>>>>> be a
>>>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>>>>
>>>>>>>>>>>> Some more ideas:
>>>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>>>> - try the unreleased 2.0 version, that one has some improvements
>>>> in
>>>>>> the
>>>>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>>>>
>>>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>>>> smallest
>>>>>>>>>>>> possible code that fails, and 2) post a small part of the raw
>> PDF,
>>>>>> i.e.
>>>>>>>>>>>> the
>>>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Tilman
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>
>>>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>>>> pages), I
>>>>>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>>>>
>>>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>>>             pdStream=firstPage.getContents();
>>>>>>>>>>>>>             COSStream stream=pdStream.getStream();
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>>>>>> debug
>>>>>>>>>>>>> mode, has the following:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>>>>>>
>>>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>>>>
>>>>>>>>>>>>>  From this point on, using the COSStream object for every page,
>>>> how
>>>>>>>>>>>>> can I
>>>>>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>>>> unFilteredStream
>>>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>>>> ​
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Thank you for your response Tilman.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     I had previously tried using the WriteDecodedDoc for my
>>>>>> compressed
>>>>>>>>>>>>>     PDF and I tried to get the number of acro form fields
>> present
>>>>>> in
>>>>>>>>>>>>> the output file generated by WriteDecodedDoc. The API still
>>>> could
>>>>>>>>>>>>>     not find the acro form fields in the generated decompressed
>>>>>> file.
>>>>>>>>>>>>>      Also the decompressed file generated is 75 KB which is far
>>>>>> less
>>>>>>>>>>>>>     than the original decompressed file which I have (1.6 MB)
>>>>>> though I
>>>>>>>>>>>>>     could edit the acro form fields using acrobat reader.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Thanks,
>>>>>>>>>>>>>     Balaji
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>>>     <THausherr@t-online.de <ma...@t-online.de>>
>>>> wrote:
>>>>>>>>>>>>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>>>>>>
>>>>>>>>>>>>>             My question is: how do I flatedecode a PDF so that I
>>>>>> can
>>>>>>>>>>>>>             find all the
>>>>>>>>>>>>>             acroform fields within it. ANy help or pointers
>> would
>>>>>> be
>>>>>>>>>>>>>             highly appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>         You could try the WriteDecodedDoc option of the command
>>>>>> line
>>>>>>>>>>>>> app
>>>>>>>>>>>>>
>>>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>>>>>>         Maybe you can have further ideas by comparing the two
>>>>>> files
>>>>>>>>>>>>>         with NOTEPAD++.... however the two files might have
>> their
>>>>>>>>>>>>>         objects in different order.
>>>>>>>>>>>>>
>>>>>>>>>>>>>         Tilman
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>         To unsubscribe, e-mail:
>>>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>>         <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>         For additional commands, e-mail:
>>>>>> users-help@pdfbox.apache.org
>>>>>>>>>>>>>         <ma...@pdfbox.apache.org>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>>>>>> users-unsubscribe@pdfbox.apache.org>
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> <mailto:
>>>>>> users-help@pdfbox.apache.org>
>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Okay, I found out the online tool used by the customer to compress their
PDF.

It is : https://www.pdfcompress.com/

I don't need to rely on the PDF sent by the customer because all PDFs that
are available on the web, are compressed in the same manner by this tool,
that is, it gets rid of all acro form fields during compression.

For example, the f941 govt form available at this site:
http://www.irs.gov/pub/irs-pdf/f941.pdf
If we compress this using the online tool, the resultant file size is very
low, which is good. However, there are no acro form fields in the
compressed PDF.

Thanks,
Balaji



On Sun, May 24, 2015 at 2:38 AM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> Hi,
>
> > Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bv...@tibco.com>:
> >
> > Hi,
> >
> > So AcroForms/Fields is an empty Array?
> >
> > Yes, in the filled interview_compressed.pdf, the acroforms are not null
> but
> > empty. Size of array is zero.
> >
> > Also, I tried qpdf command line tool to compress the file interview.pdf
> and
> > the resultant compressed file size of 1.6MB was no way near the file size
> > of interview_compressed.pdf (21 KB).
>
> would you think it's possible to get a similar PDF file or permission to
> use it internally so we have a sample to look at a potential fix.
>
> Although the PDF is not inline with the spec as Acrobat is able to handle
> it we could look into getting a similar result.
>
> BR
> Maruan
>
>
> >
> > Thanks,
> > Balaji
> >
> > On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bvenkata@tibco.com
> >:
> >>>
> >>> I opened the interview_compressed in notepad++ and did not see any
> >>> 'Acroform' text anywhere.
> >>> However, as Maruan suggested, I entered some data into what looks like
> >> form
> >>> fields of interview_compressed.pdf and saved it. When I opened this
> file
> >> in
> >>> notepad++, I did see 'Acroform' text in it. I also noticed an increase
> in
> >>> file size from 21 KB to ~530 KB.
> >>>
> >>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
> >>> that the field values were getting stored but not under Acroform fields
> >> but
> >>> under Annotations.
> >>
> >>
> >>
> >> So AcroForms/Fields is an empty Array?
> >>
> >>> Please refer to this image:
> >>>
> >>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
> >>>
> >>> So, whatever the compression technique was, it simply made all the
> >> Acroform
> >>> fields disappear from the original PDF but retained all annotations
> which
> >>> also contain the interactive forms and this helped reduce the file size
> >> so
> >>> much? If this is the case, can pdfbox API also use similar compression
> >>> technique to compress such a a huge file into a smaller one?
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <
> THausherr@t-online.de
> >>> :
> >>>>>
> >>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I used PdfDebugger to make the internal PDF structure of the two
> files
> >>>> (1)
> >>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
> >> and I
> >>>>>> have uploaded my images to imageshack. Here are the four links:
> >>>>>>
> >>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
> >>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> >>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
> >>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> >>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
> >>>>>>
> >>>>>> The first two links are from the internal structure of interview.pdf
> >>>>>> (original uncompressed file)
> >>>>>> The third and fourth links are from the internal structure of
> >>>>>> interview_compressed.pdf (compressed file)
> >>>>>> The fifth link compares the file sizes of the two files and as you
> can
> >>>> also
> >>>>>> see, the difference is huge.
> >>>>>>
> >>>>>> As you might notice, the file interview_compressed.pdf has no
> acroform
> >>>>>
> >>>>> Indeed... but this is needed - from the spec:
> >>>>>
> >>>>> "The contents and properties of a document’s interactive form shall
> be
> >>>> defined by an interactive form dictionary that shall be referenced
> from
> >> the
> >>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
> >> Catalog”).
> >>>> Table 218 shows the contents of this dictionary."
> >>>>>
> >>>>
> >>>> correct
> >>>>
> >>>>>> fields listed even though opening the PDF in pdf reader allows me to
> >>>> enter
> >>>>>> values in places which look like AcroForm fields and also save them.
> >> Are
> >>>>>> there any other PDF 'types' similar to Acroform fields which would
> >>>> enable
> >>>>>> users to fill data and which can be accessed in PdfBox APIs without
> >>>> having
> >>>>>> to go through PDAcrofield?
> >>>>>
> >>>>> Yes, annotations... there are some common parts, but this is just a
> >>>> vague observation from me, I'm not the acroform specialist.
> >>>>
> >>>> from a first glance it looks like there are all entries necessary to
> >> (re-)
> >>>> generate the form fields. That's what's likely happening for this
> >> document
> >>>> in Adobe Reader. Would be interesting to see what's being save after
> the
> >>>> forms has been filled out and saved using Acrobat. We'd need a test
> >> form to
> >>>> come up with an enhancement like this.
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>
> >>>>>
> >>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
> >> in
> >>>> the "compressed" file.
> >>>>> - if it is missing, tell the client (or your boss) just that
> >>>>> - if it isn't missing, then there's some problem in PDFBox (try also
> >> the
> >>>> loadNonSeq I mentioned earlier)
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>>
> >>>>>> You can use qpdf , then use these options:
> >>>>>>
> >>>>>> I will now try using this link to compress the original file.
> >>>>>>
> >>>>>> Another strategy to think about - can your client generate a
> >>>>>> non-confidential file, so that you can share it, and the
> "compressed"
> >>>> file?
> >>>>>>
> >>>>>> I wish I had direct communication with the clients but due to
> >>>> bureaucracy,
> >>>>>> I am having to go through multiple layers to get my message across
> to
> >>>> them.
> >>>>>> I will share more information as soon as I have them.
> >>>>>>
> >>>>>> PS: i sent these image links to my personal email first to make sure
> >>>> that I
> >>>>>> can open them. I could and so I am hoping you all could too. If you
> >> are
> >>>>>> unable to open them, please let me know.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Balaji
> >>>>>>
> >>>>>>
> >>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
> >> THausherr@t-online.de
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
> >>>> 03:24
> >>>>>>>>> geschrieben:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thank you for your pointers and sorry about the image. I am
> >>>> attaching it
> >>>>>>>>> with this email.
> >>>>>>>>>
> >>>>>>>>> The point I am trying to make is that the PDF, which was
> >> decompressed
> >>>>>>>>> using
> >>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given
> to
> >>>> us by
> >>>>>>>>> our customers.
> >>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of
> PDFBox
> >>>> did
> >>>>>>>>> not
> >>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to
> us
> >>>> by
> >>>>>>>>> the
> >>>>>>>>> customers does contain Acroform fields. Hence I wanted to know
> how
> >> to
> >>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I
> was
> >>>>>>>>> analyzing COSStream was to check if the decompression of the
> >>>> compressed
> >>>>>>>>> PDF
> >>>>>>>>> was happening correctly while using PDFBox APIs.
> >>>>>>>>> I know it would have been difficult for you to help me without
> the
> >>>> actual
> >>>>>>>>> PDFs. For that, I would like to thank you for your time and
> >> pointers.
> >>>>>>>>>
> >>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
> >>>> both
> >>>>>>>> files
> >>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
> >>>> screenshot
> >>>>>>>> of both
> >>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
> >> could
> >>>>>>>> shed some
> >>>>>>>> light on your issue.
> >>>>>>>>
> >>>>>>> @Balaji: here's an example on how such a screenshot would look
> like:
> >>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
> >>>>>>>
> >>>>>>> Tilman
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> BR
> >>>>>>>> Andreas Lehmkühler
> >>>>>>>>
> >>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
> >>>>>>>>
> >>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
> >>>> THausherr@t-online.de>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>> The image doesn't appear in the mailing list.
> >>>>>>>>>>
> >>>>>>>>>> This is all very confusing... /acroform is in the document
> >> catalog.
> >>>> I
> >>>>>>>>>> don't see how the page content stream is related to it. The best
> >> is
> >>>> that
> >>>>>>>>>> you either go through the source code, or read the spec and then
> >>>> look at
> >>>>>>>>>> the pdf.
> >>>>>>>>>>
> >>>>>>>>>> To find out what's going on, you'd have to start from that
> >> /acroform
> >>>>>>>>>> entry
> >>>>>>>>>> and then compare the two files.
> >>>>>>>>>>
> >>>>>>>>>> It is really difficult to help you without the files. The cause
> >>>> could
> >>>>>>>>>> be a
> >>>>>>>>>> bug in pdfbox, or a malformed pdf...
> >>>>>>>>>>
> >>>>>>>>>> Some more ideas:
> >>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
> >>>>>>>>>> - try the unreleased 2.0 version, that one has some improvements
> >> in
> >>>> the
> >>>>>>>>>> acroform stuff. Note that the API is different.
> >>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
> >>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
> >>>>>>>>>>
> >>>>>>>>>> If you still need help, one possibility would be 1) post the
> >>>> smallest
> >>>>>>>>>> possible code that fails, and 2) post a small part of the raw
> PDF,
> >>>> i.e.
> >>>>>>>>>> the
> >>>>>>>>>> objects relevant to the field in your code.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Tilman
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> >>>>>>>>>>
> >>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
> >>>> pages), I
> >>>>>>>>>>> tried getting the COSStream for each of the page :
> >>>>>>>>>>>
> >>>>>>>>>>> PDPage firstPage=(PDPage)
> >>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
> >>>>>>>>>>>            pdStream=firstPage.getContents();
> >>>>>>>>>>>            COSStream stream=pdStream.getStream();
> >>>>>>>>>>>
> >>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
> >>>> debug
> >>>>>>>>>>> mode, has the following:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
> >>>>>>>>>>>
> >>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
> >>>>>>>>>>>
> >>>>>>>>>>> From this point on, using the COSStream object for every page,
> >> how
> >>>>>>>>>>> can I
> >>>>>>>>>>> decompress and find out the acroform fields given that the
> >>>>>>>>>>> unFilteredStream
> >>>>>>>>>>> object is null for COSStream?
> >>>>>>>>>>> ​
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
> >>>>>>>>>>> bvenkata@tibco.com
> >>>>>>>>>>> <ma...@tibco.com>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>    Thank you for your response Tilman.
> >>>>>>>>>>>
> >>>>>>>>>>>    I had previously tried using the WriteDecodedDoc for my
> >>>> compressed
> >>>>>>>>>>>    PDF and I tried to get the number of acro form fields
> present
> >>>> in
> >>>>>>>>>>> the output file generated by WriteDecodedDoc. The API still
> >> could
> >>>>>>>>>>>    not find the acro form fields in the generated decompressed
> >>>> file.
> >>>>>>>>>>>     Also the decompressed file generated is 75 KB which is far
> >>>> less
> >>>>>>>>>>>    than the original decompressed file which I have (1.6 MB)
> >>>> though I
> >>>>>>>>>>>    could edit the acro form fields using acrobat reader.
> >>>>>>>>>>>
> >>>>>>>>>>>    Thanks,
> >>>>>>>>>>>    Balaji
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>    On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
> >>>>>>>>>>>    <THausherr@t-online.de <ma...@t-online.de>>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>        Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
> >>>>>>>>>>>
> >>>>>>>>>>>            My question is: how do I flatedecode a PDF so that I
> >>>> can
> >>>>>>>>>>>            find all the
> >>>>>>>>>>>            acroform fields within it. ANy help or pointers
> would
> >>>> be
> >>>>>>>>>>>            highly appreciated.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        You could try the WriteDecodedDoc option of the command
> >>>> line
> >>>>>>>>>>> app
> >>>>>>>>>>>
> >>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
> >>>>>>>>>>>
> >>>>>>>>>>>        Maybe you can have further ideas by comparing the two
> >>>> files
> >>>>>>>>>>>        with NOTEPAD++.... however the two files might have
> their
> >>>>>>>>>>>        objects in different order.
> >>>>>>>>>>>
> >>>>>>>>>>>        Tilman
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>>>>        To unsubscribe, e-mail:
> >>>> users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>        <ma...@pdfbox.apache.org>
> >>>>>>>>>>>        For additional commands, e-mail:
> >>>> users-help@pdfbox.apache.org
> >>>>>>>>>>>        <ma...@pdfbox.apache.org>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
> >>>> users-unsubscribe@pdfbox.apache.org>
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> <mailto:
> >>>> users-help@pdfbox.apache.org>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

> Am 23.05.2015 um 16:37 schrieb Balaji Venkatamohan <bv...@tibco.com>:
> 
> Hi,
> 
> So AcroForms/Fields is an empty Array?
> 
> Yes, in the filled interview_compressed.pdf, the acroforms are not null but
> empty. Size of array is zero.
> 
> Also, I tried qpdf command line tool to compress the file interview.pdf and
> the resultant compressed file size of 1.6MB was no way near the file size
> of interview_compressed.pdf (21 KB).

would you think it's possible to get a similar PDF file or permission to use it internally so we have a sample to look at a potential fix.

Although the PDF is not inline with the spec as Acrobat is able to handle it we could look into getting a similar result.

BR
Maruan


> 
> Thanks,
> Balaji
> 
> On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bv...@tibco.com>:
>>> 
>>> I opened the interview_compressed in notepad++ and did not see any
>>> 'Acroform' text anywhere.
>>> However, as Maruan suggested, I entered some data into what looks like
>> form
>>> fields of interview_compressed.pdf and saved it. When I opened this file
>> in
>>> notepad++, I did see 'Acroform' text in it. I also noticed an increase in
>>> file size from 21 KB to ~530 KB.
>>> 
>>> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
>>> that the field values were getting stored but not under Acroform fields
>> but
>>> under Annotations.
>> 
>> 
>> 
>> So AcroForms/Fields is an empty Array?
>> 
>>> Please refer to this image:
>>> 
>>> http://imageshack.com/a/img540/9951/QGLDtS.jpg
>>> 
>>> So, whatever the compression technique was, it simply made all the
>> Acroform
>>> fields disappear from the original PDF but retained all annotations which
>>> also contain the interactive forms and this helped reduce the file size
>> so
>>> much? If this is the case, can pdfbox API also use similar compression
>>> technique to compress such a a huge file into a smaller one?
>>> 
>>> 
>>> 
>>> 
>>> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <sa...@fileaffairs.de>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <THausherr@t-online.de
>>> :
>>>>> 
>>>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>>>> Hello,
>>>>>> 
>>>>>> I used PdfDebugger to make the internal PDF structure of the two files
>>>> (1)
>>>>>> interview.pdf and (2) interview_compressed.pdf  visually available
>> and I
>>>>>> have uploaded my images to imageshack. Here are the four links:
>>>>>> 
>>>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>>>> 
>>>>>> The first two links are from the internal structure of interview.pdf
>>>>>> (original uncompressed file)
>>>>>> The third and fourth links are from the internal structure of
>>>>>> interview_compressed.pdf (compressed file)
>>>>>> The fifth link compares the file sizes of the two files and as you can
>>>> also
>>>>>> see, the difference is huge.
>>>>>> 
>>>>>> As you might notice, the file interview_compressed.pdf has no acroform
>>>>> 
>>>>> Indeed... but this is needed - from the spec:
>>>>> 
>>>>> "The contents and properties of a document’s interactive form shall be
>>>> defined by an interactive form dictionary that shall be referenced from
>> the
>>>> AcroForm entry in the document catalogue (see 7.7.2, “Document
>> Catalog”).
>>>> Table 218 shows the contents of this dictionary."
>>>>> 
>>>> 
>>>> correct
>>>> 
>>>>>> fields listed even though opening the PDF in pdf reader allows me to
>>>> enter
>>>>>> values in places which look like AcroForm fields and also save them.
>> Are
>>>>>> there any other PDF 'types' similar to Acroform fields which would
>>>> enable
>>>>>> users to fill data and which can be accessed in PdfBox APIs without
>>>> having
>>>>>> to go through PDAcrofield?
>>>>> 
>>>>> Yes, annotations... there are some common parts, but this is just a
>>>> vague observation from me, I'm not the acroform specialist.
>>>> 
>>>> from a first glance it looks like there are all entries necessary to
>> (re-)
>>>> generate the form fields. That's what's likely happening for this
>> document
>>>> in Adobe Reader. Would be interesting to see what's being save after the
>>>> forms has been filled out and saved using Acrobat. We'd need a test
>> form to
>>>> come up with an enhancement like this.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>>> 
>>>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
>> in
>>>> the "compressed" file.
>>>>> - if it is missing, tell the client (or your boss) just that
>>>>> - if it isn't missing, then there's some problem in PDFBox (try also
>> the
>>>> loadNonSeq I mentioned earlier)
>>>>> 
>>>>> Tilman
>>>>> 
>>>>>> 
>>>>>> You can use qpdf , then use these options:
>>>>>> 
>>>>>> I will now try using this link to compress the original file.
>>>>>> 
>>>>>> Another strategy to think about - can your client generate a
>>>>>> non-confidential file, so that you can share it, and the "compressed"
>>>> file?
>>>>>> 
>>>>>> I wish I had direct communication with the clients but due to
>>>> bureaucracy,
>>>>>> I am having to go through multiple layers to get my message across to
>>>> them.
>>>>>> I will share more information as soon as I have them.
>>>>>> 
>>>>>> PS: i sent these image links to my personal email first to make sure
>>>> that I
>>>>>> can open them. I could and so I am hoping you all could too. If you
>> are
>>>>>> unable to open them, please let me know.
>>>>>> 
>>>>>> Thanks,
>>>>>> Balaji
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
>> THausherr@t-online.de
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
>>>> 03:24
>>>>>>>>> geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you for your pointers and sorry about the image. I am
>>>> attaching it
>>>>>>>>> with this email.
>>>>>>>>> 
>>>>>>>>> The point I am trying to make is that the PDF, which was
>> decompressed
>>>>>>>>> using
>>>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to
>>>> us by
>>>>>>>>> our customers.
>>>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
>>>> did
>>>>>>>>> not
>>>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us
>>>> by
>>>>>>>>> the
>>>>>>>>> customers does contain Acroform fields. Hence I wanted to know how
>> to
>>>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>>>>>>>> analyzing COSStream was to check if the decompression of the
>>>> compressed
>>>>>>>>> PDF
>>>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>>>> I know it would have been difficult for you to help me without the
>>>> actual
>>>>>>>>> PDFs. For that, I would like to thank you for your time and
>> pointers.
>>>>>>>>> 
>>>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>>>> both
>>>>>>>> files
>>>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>>>> screenshot
>>>>>>>> of both
>>>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
>> could
>>>>>>>> shed some
>>>>>>>> light on your issue.
>>>>>>>> 
>>>>>>> @Balaji: here's an example on how such a screenshot would look like:
>>>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>>>> 
>>>>>>> Tilman
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> BR
>>>>>>>> Andreas Lehmkühler
>>>>>>>> 
>>>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>>>> 
>>>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>>>> THausherr@t-online.de>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>>>> 
>>>>>>>>>> This is all very confusing... /acroform is in the document
>> catalog.
>>>> I
>>>>>>>>>> don't see how the page content stream is related to it. The best
>> is
>>>> that
>>>>>>>>>> you either go through the source code, or read the spec and then
>>>> look at
>>>>>>>>>> the pdf.
>>>>>>>>>> 
>>>>>>>>>> To find out what's going on, you'd have to start from that
>> /acroform
>>>>>>>>>> entry
>>>>>>>>>> and then compare the two files.
>>>>>>>>>> 
>>>>>>>>>> It is really difficult to help you without the files. The cause
>>>> could
>>>>>>>>>> be a
>>>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>>>> 
>>>>>>>>>> Some more ideas:
>>>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>>>> - try the unreleased 2.0 version, that one has some improvements
>> in
>>>> the
>>>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>>>> 
>>>>>>>>>> If you still need help, one possibility would be 1) post the
>>>> smallest
>>>>>>>>>> possible code that fails, and 2) post a small part of the raw PDF,
>>>> i.e.
>>>>>>>>>> the
>>>>>>>>>> objects relevant to the field in your code.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Tilman
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>>>> 
>>>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>>>> pages), I
>>>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>>>> 
>>>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>>>            pdStream=firstPage.getContents();
>>>>>>>>>>>            COSStream stream=pdStream.getStream();
>>>>>>>>>>> 
>>>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>>>> debug
>>>>>>>>>>> mode, has the following:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>>>> 
>>>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>>>> 
>>>>>>>>>>> From this point on, using the COSStream object for every page,
>> how
>>>>>>>>>>> can I
>>>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>>>> unFilteredStream
>>>>>>>>>>> object is null for COSStream?
>>>>>>>>>>> ​
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>>>> bvenkata@tibco.com
>>>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>    Thank you for your response Tilman.
>>>>>>>>>>> 
>>>>>>>>>>>    I had previously tried using the WriteDecodedDoc for my
>>>> compressed
>>>>>>>>>>>    PDF and I tried to get the number of acro form fields present
>>>> in
>>>>>>>>>>> the output file generated by WriteDecodedDoc. The API still
>> could
>>>>>>>>>>>    not find the acro form fields in the generated decompressed
>>>> file.
>>>>>>>>>>>     Also the decompressed file generated is 75 KB which is far
>>>> less
>>>>>>>>>>>    than the original decompressed file which I have (1.6 MB)
>>>> though I
>>>>>>>>>>>    could edit the acro form fields using acrobat reader.
>>>>>>>>>>> 
>>>>>>>>>>>    Thanks,
>>>>>>>>>>>    Balaji
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>    On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>>>    <THausherr@t-online.de <ma...@t-online.de>>
>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>        Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>>>> 
>>>>>>>>>>>            My question is: how do I flatedecode a PDF so that I
>>>> can
>>>>>>>>>>>            find all the
>>>>>>>>>>>            acroform fields within it. ANy help or pointers would
>>>> be
>>>>>>>>>>>            highly appreciated.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>        You could try the WriteDecodedDoc option of the command
>>>> line
>>>>>>>>>>> app
>>>>>>>>>>> 
>>>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>>>> 
>>>>>>>>>>>        Maybe you can have further ideas by comparing the two
>>>> files
>>>>>>>>>>>        with NOTEPAD++.... however the two files might have their
>>>>>>>>>>>        objects in different order.
>>>>>>>>>>> 
>>>>>>>>>>>        Tilman
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>>        To unsubscribe, e-mail:
>>>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>        <ma...@pdfbox.apache.org>
>>>>>>>>>>>        For additional commands, e-mail:
>>>> users-help@pdfbox.apache.org
>>>>>>>>>>>        <ma...@pdfbox.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>>>> users-unsubscribe@pdfbox.apache.org>
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
>>>> users-help@pdfbox.apache.org>
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Hi,

So AcroForms/Fields is an empty Array?

Yes, in the filled interview_compressed.pdf, the acroforms are not null but
empty. Size of array is zero.

Also, I tried qpdf command line tool to compress the file interview.pdf and
the resultant compressed file size of 1.6MB was no way near the file size
of interview_compressed.pdf (21 KB).

Thanks,
Balaji

On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> Hi,
>
> > Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bv...@tibco.com>:
> >
> > I opened the interview_compressed in notepad++ and did not see any
> > 'Acroform' text anywhere.
> > However, as Maruan suggested, I entered some data into what looks like
> form
> > fields of interview_compressed.pdf and saved it. When I opened this file
> in
> > notepad++, I did see 'Acroform' text in it. I also noticed an increase in
> > file size from 21 KB to ~530 KB.
> >
> > I then ran this filled saved compressed PDF in pdfdebugger.java and saw
> > that the field values were getting stored but not under Acroform fields
> but
> > under Annotations.
>
>
>
> So AcroForms/Fields is an empty Array?
>
> > Please refer to this image:
> >
> > http://imageshack.com/a/img540/9951/QGLDtS.jpg
> >
> > So, whatever the compression technique was, it simply made all the
> Acroform
> > fields disappear from the original PDF but retained all annotations which
> > also contain the interactive forms and this helped reduce the file size
> so
> > much? If this is the case, can pdfbox API also use similar compression
> > technique to compress such a a huge file into a smaller one?
> >
> >
> >
> >
> > On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <sa...@fileaffairs.de>
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <THausherr@t-online.de
> >:
> >>>
> >>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> >>>> Hello,
> >>>>
> >>>> I used PdfDebugger to make the internal PDF structure of the two files
> >> (1)
> >>>> interview.pdf and (2) interview_compressed.pdf  visually available
> and I
> >>>> have uploaded my images to imageshack. Here are the four links:
> >>>>
> >>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
> >>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> >>>> http://imageshack.com/a/img903/8644/mk15As.jpg
> >>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> >>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
> >>>>
> >>>> The first two links are from the internal structure of interview.pdf
> >>>> (original uncompressed file)
> >>>> The third and fourth links are from the internal structure of
> >>>> interview_compressed.pdf (compressed file)
> >>>> The fifth link compares the file sizes of the two files and as you can
> >> also
> >>>> see, the difference is huge.
> >>>>
> >>>> As you might notice, the file interview_compressed.pdf has no acroform
> >>>
> >>> Indeed... but this is needed - from the spec:
> >>>
> >>> "The contents and properties of a document’s interactive form shall be
> >> defined by an interactive form dictionary that shall be referenced from
> the
> >> AcroForm entry in the document catalogue (see 7.7.2, “Document
> Catalog”).
> >> Table 218 shows the contents of this dictionary."
> >>>
> >>
> >> correct
> >>
> >>>> fields listed even though opening the PDF in pdf reader allows me to
> >> enter
> >>>> values in places which look like AcroForm fields and also save them.
> Are
> >>>> there any other PDF 'types' similar to Acroform fields which would
> >> enable
> >>>> users to fill data and which can be accessed in PdfBox APIs without
> >> having
> >>>> to go through PDAcrofield?
> >>>
> >>> Yes, annotations... there are some common parts, but this is just a
> >> vague observation from me, I'm not the acroform specialist.
> >>
> >> from a first glance it looks like there are all entries necessary to
> (re-)
> >> generate the form fields. That's what's likely happening for this
> document
> >> in Adobe Reader. Would be interesting to see what's being save after the
> >> forms has been filled out and saved using Acrobat. We'd need a test
> form to
> >> come up with an enhancement like this.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>>
> >>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
> in
> >> the "compressed" file.
> >>> - if it is missing, tell the client (or your boss) just that
> >>> - if it isn't missing, then there's some problem in PDFBox (try also
> the
> >> loadNonSeq I mentioned earlier)
> >>>
> >>> Tilman
> >>>
> >>>>
> >>>> You can use qpdf , then use these options:
> >>>>
> >>>> I will now try using this link to compress the original file.
> >>>>
> >>>> Another strategy to think about - can your client generate a
> >>>> non-confidential file, so that you can share it, and the "compressed"
> >> file?
> >>>>
> >>>> I wish I had direct communication with the clients but due to
> >> bureaucracy,
> >>>> I am having to go through multiple layers to get my message across to
> >> them.
> >>>> I will share more information as soon as I have them.
> >>>>
> >>>> PS: i sent these image links to my personal email first to make sure
> >> that I
> >>>> can open them. I could and so I am hoping you all could too. If you
> are
> >>>> unable to open them, please let me know.
> >>>>
> >>>> Thanks,
> >>>> Balaji
> >>>>
> >>>>
> >>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
> THausherr@t-online.de
> >>>
> >>>> wrote:
> >>>>
> >>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
> >> 03:24
> >>>>>>> geschrieben:
> >>>>>>>
> >>>>>>>
> >>>>>>> Thank you for your pointers and sorry about the image. I am
> >> attaching it
> >>>>>>> with this email.
> >>>>>>>
> >>>>>>> The point I am trying to make is that the PDF, which was
> decompressed
> >>>>>>> using
> >>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to
> >> us by
> >>>>>>> our customers.
> >>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
> >> did
> >>>>>>> not
> >>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us
> >> by
> >>>>>>> the
> >>>>>>> customers does contain Acroform fields. Hence I wanted to know how
> to
> >>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
> >>>>>>> analyzing COSStream was to check if the decompression of the
> >> compressed
> >>>>>>> PDF
> >>>>>>> was happening correctly while using PDFBox APIs.
> >>>>>>> I know it would have been difficult for you to help me without the
> >> actual
> >>>>>>> PDFs. For that, I would like to thank you for your time and
> pointers.
> >>>>>>>
> >>>>>> Maybe it's worth to try to share the file "visually" with us. Open
> >> both
> >>>>>> files
> >>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
> >> screenshot
> >>>>>> of both
> >>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
> could
> >>>>>> shed some
> >>>>>> light on your issue.
> >>>>>>
> >>>>> @Balaji: here's an example on how such a screenshot would look like:
> >>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>
> >>>>>
> >>>>>> BR
> >>>>>> Andreas Lehmkühler
> >>>>>>
> >>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
> >>>>>>
> >>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
> >> THausherr@t-online.de>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>> The image doesn't appear in the mailing list.
> >>>>>>>>
> >>>>>>>> This is all very confusing... /acroform is in the document
> catalog.
> >> I
> >>>>>>>> don't see how the page content stream is related to it. The best
> is
> >> that
> >>>>>>>> you either go through the source code, or read the spec and then
> >> look at
> >>>>>>>> the pdf.
> >>>>>>>>
> >>>>>>>> To find out what's going on, you'd have to start from that
> /acroform
> >>>>>>>> entry
> >>>>>>>> and then compare the two files.
> >>>>>>>>
> >>>>>>>> It is really difficult to help you without the files. The cause
> >> could
> >>>>>>>> be a
> >>>>>>>> bug in pdfbox, or a malformed pdf...
> >>>>>>>>
> >>>>>>>> Some more ideas:
> >>>>>>>> - use loadNonSeq(file, null) instead of load(file)
> >>>>>>>> - try the unreleased 2.0 version, that one has some improvements
> in
> >> the
> >>>>>>>> acroform stuff. Note that the API is different.
> >>>>>>>> https://pdfbox.apache.org/download.cgi#scm
> >>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
> >>>>>>>>
> >>>>>>>> If you still need help, one possibility would be 1) post the
> >> smallest
> >>>>>>>> possible code that fails, and 2) post a small part of the raw PDF,
> >> i.e.
> >>>>>>>> the
> >>>>>>>> objects relevant to the field in your code.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Tilman
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> >>>>>>>>
> >>>>>>>> Moreover, for every page of the compressed PDF (there are 3
> >> pages), I
> >>>>>>>>> tried getting the COSStream for each of the page :
> >>>>>>>>>
> >>>>>>>>> PDPage firstPage=(PDPage)
> >>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
> >>>>>>>>>             pdStream=firstPage.getContents();
> >>>>>>>>>             COSStream stream=pdStream.getStream();
> >>>>>>>>>
> >>>>>>>>> In the above code snippet, the object stream, when analyzed in
> >> debug
> >>>>>>>>> mode, has the following:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
> >>>>>>>>>
> >>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
> >>>>>>>>>
> >>>>>>>>> From this point on, using the COSStream object for every page,
> how
> >>>>>>>>> can I
> >>>>>>>>> decompress and find out the acroform fields given that the
> >>>>>>>>> unFilteredStream
> >>>>>>>>> object is null for COSStream?
> >>>>>>>>> ​
> >>>>>>>>>
> >>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
> >>>>>>>>> bvenkata@tibco.com
> >>>>>>>>> <ma...@tibco.com>> wrote:
> >>>>>>>>>
> >>>>>>>>>     Thank you for your response Tilman.
> >>>>>>>>>
> >>>>>>>>>     I had previously tried using the WriteDecodedDoc for my
> >> compressed
> >>>>>>>>>     PDF and I tried to get the number of acro form fields present
> >> in
> >>>>>>>>>  the output file generated by WriteDecodedDoc. The API still
> could
> >>>>>>>>>     not find the acro form fields in the generated decompressed
> >> file.
> >>>>>>>>>      Also the decompressed file generated is 75 KB which is far
> >> less
> >>>>>>>>>     than the original decompressed file which I have (1.6 MB)
> >> though I
> >>>>>>>>>     could edit the acro form fields using acrobat reader.
> >>>>>>>>>
> >>>>>>>>>     Thanks,
> >>>>>>>>>     Balaji
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
> >>>>>>>>>     <THausherr@t-online.de <ma...@t-online.de>>
> wrote:
> >>>>>>>>>
> >>>>>>>>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
> >>>>>>>>>
> >>>>>>>>>             My question is: how do I flatedecode a PDF so that I
> >> can
> >>>>>>>>>             find all the
> >>>>>>>>>             acroform fields within it. ANy help or pointers would
> >> be
> >>>>>>>>>             highly appreciated.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>         You could try the WriteDecodedDoc option of the command
> >> line
> >>>>>>>>> app
> >>>>>>>>>
> >> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
> >>>>>>>>>
> >>>>>>>>>         Maybe you can have further ideas by comparing the two
> >> files
> >>>>>>>>>         with NOTEPAD++.... however the two files might have their
> >>>>>>>>>         objects in different order.
> >>>>>>>>>
> >>>>>>>>>         Tilman
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>         To unsubscribe, e-mail:
> >> users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>         <ma...@pdfbox.apache.org>
> >>>>>>>>>         For additional commands, e-mail:
> >> users-help@pdfbox.apache.org
> >>>>>>>>>         <ma...@pdfbox.apache.org>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>
> >>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>
> >>>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
> >> users-unsubscribe@pdfbox.apache.org>
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
> >> users-help@pdfbox.apache.org>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

> Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <bv...@tibco.com>:
> 
> I opened the interview_compressed in notepad++ and did not see any
> 'Acroform' text anywhere.
> However, as Maruan suggested, I entered some data into what looks like form
> fields of interview_compressed.pdf and saved it. When I opened this file in
> notepad++, I did see 'Acroform' text in it. I also noticed an increase in
> file size from 21 KB to ~530 KB.
> 
> I then ran this filled saved compressed PDF in pdfdebugger.java and saw
> that the field values were getting stored but not under Acroform fields but
> under Annotations.



So AcroForms/Fields is an empty Array?

> Please refer to this image:
> 
> http://imageshack.com/a/img540/9951/QGLDtS.jpg
> 
> So, whatever the compression technique was, it simply made all the Acroform
> fields disappear from the original PDF but retained all annotations which
> also contain the interactive forms and this helped reduce the file size so
> much? If this is the case, can pdfbox API also use similar compression
> technique to compress such a a huge file into a smaller one?
> 
> 
> 
> 
> On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <TH...@t-online.de>:
>>> 
>>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>>>> Hello,
>>>> 
>>>> I used PdfDebugger to make the internal PDF structure of the two files
>> (1)
>>>> interview.pdf and (2) interview_compressed.pdf  visually available and I
>>>> have uploaded my images to imageshack. Here are the four links:
>>>> 
>>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>>>> http://imageshack.com/a/img903/8644/mk15As.jpg
>>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>>>> 
>>>> The first two links are from the internal structure of interview.pdf
>>>> (original uncompressed file)
>>>> The third and fourth links are from the internal structure of
>>>> interview_compressed.pdf (compressed file)
>>>> The fifth link compares the file sizes of the two files and as you can
>> also
>>>> see, the difference is huge.
>>>> 
>>>> As you might notice, the file interview_compressed.pdf has no acroform
>>> 
>>> Indeed... but this is needed - from the spec:
>>> 
>>> "The contents and properties of a document’s interactive form shall be
>> defined by an interactive form dictionary that shall be referenced from the
>> AcroForm entry in the document catalogue (see 7.7.2, “Document Catalog”).
>> Table 218 shows the contents of this dictionary."
>>> 
>> 
>> correct
>> 
>>>> fields listed even though opening the PDF in pdf reader allows me to
>> enter
>>>> values in places which look like AcroForm fields and also save them. Are
>>>> there any other PDF 'types' similar to Acroform fields which would
>> enable
>>>> users to fill data and which can be accessed in PdfBox APIs without
>> having
>>>> to go through PDAcrofield?
>>> 
>>> Yes, annotations... there are some common parts, but this is just a
>> vague observation from me, I'm not the acroform specialist.
>> 
>> from a first glance it looks like there are all entries necessary to (re-)
>> generate the form fields. That's what's likely happening for this document
>> in Adobe Reader. Would be interesting to see what's being save after the
>> forms has been filled out and saved using Acrobat. We'd need a test form to
>> come up with an enhancement like this.
>> 
>> BR
>> Maruan
>> 
>> 
>>> 
>>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in
>> the "compressed" file.
>>> - if it is missing, tell the client (or your boss) just that
>>> - if it isn't missing, then there's some problem in PDFBox (try also the
>> loadNonSeq I mentioned earlier)
>>> 
>>> Tilman
>>> 
>>>> 
>>>> You can use qpdf , then use these options:
>>>> 
>>>> I will now try using this link to compress the original file.
>>>> 
>>>> Another strategy to think about - can your client generate a
>>>> non-confidential file, so that you can share it, and the "compressed"
>> file?
>>>> 
>>>> I wish I had direct communication with the clients but due to
>> bureaucracy,
>>>> I am having to go through multiple layers to get my message across to
>> them.
>>>> I will share more information as soon as I have them.
>>>> 
>>>> PS: i sent these image links to my personal email first to make sure
>> that I
>>>> can open them. I could and so I am hoping you all could too. If you are
>>>> unable to open them, please let me know.
>>>> 
>>>> Thanks,
>>>> Balaji
>>>> 
>>>> 
>>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <THausherr@t-online.de
>>> 
>>>> wrote:
>>>> 
>>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
>> 03:24
>>>>>>> geschrieben:
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you for your pointers and sorry about the image. I am
>> attaching it
>>>>>>> with this email.
>>>>>>> 
>>>>>>> The point I am trying to make is that the PDF, which was decompressed
>>>>>>> using
>>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to
>> us by
>>>>>>> our customers.
>>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
>> did
>>>>>>> not
>>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us
>> by
>>>>>>> the
>>>>>>> customers does contain Acroform fields. Hence I wanted to know how to
>>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>>>>>> analyzing COSStream was to check if the decompression of the
>> compressed
>>>>>>> PDF
>>>>>>> was happening correctly while using PDFBox APIs.
>>>>>>> I know it would have been difficult for you to help me without the
>> actual
>>>>>>> PDFs. For that, I would like to thank you for your time and pointers.
>>>>>>> 
>>>>>> Maybe it's worth to try to share the file "visually" with us. Open
>> both
>>>>>> files
>>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
>> screenshot
>>>>>> of both
>>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that could
>>>>>> shed some
>>>>>> light on your issue.
>>>>>> 
>>>>> @Balaji: here's an example on how such a screenshot would look like:
>>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> 
>>>>> 
>>>>>> BR
>>>>>> Andreas Lehmkühler
>>>>>> 
>>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>>>> 
>>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>>> The image doesn't appear in the mailing list.
>>>>>>>> 
>>>>>>>> This is all very confusing... /acroform is in the document catalog.
>> I
>>>>>>>> don't see how the page content stream is related to it. The best is
>> that
>>>>>>>> you either go through the source code, or read the spec and then
>> look at
>>>>>>>> the pdf.
>>>>>>>> 
>>>>>>>> To find out what's going on, you'd have to start from that /acroform
>>>>>>>> entry
>>>>>>>> and then compare the two files.
>>>>>>>> 
>>>>>>>> It is really difficult to help you without the files. The cause
>> could
>>>>>>>> be a
>>>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>>>> 
>>>>>>>> Some more ideas:
>>>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>>>> - try the unreleased 2.0 version, that one has some improvements in
>> the
>>>>>>>> acroform stuff. Note that the API is different.
>>>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>>>> 
>>>>>>>> If you still need help, one possibility would be 1) post the
>> smallest
>>>>>>>> possible code that fails, and 2) post a small part of the raw PDF,
>> i.e.
>>>>>>>> the
>>>>>>>> objects relevant to the field in your code.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Tilman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>>>> 
>>>>>>>> Moreover, for every page of the compressed PDF (there are 3
>> pages), I
>>>>>>>>> tried getting the COSStream for each of the page :
>>>>>>>>> 
>>>>>>>>> PDPage firstPage=(PDPage)
>>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>>>             pdStream=firstPage.getContents();
>>>>>>>>>             COSStream stream=pdStream.getStream();
>>>>>>>>> 
>>>>>>>>> In the above code snippet, the object stream, when analyzed in
>> debug
>>>>>>>>> mode, has the following:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>>>> 
>>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>>>> 
>>>>>>>>> From this point on, using the COSStream object for every page, how
>>>>>>>>> can I
>>>>>>>>> decompress and find out the acroform fields given that the
>>>>>>>>> unFilteredStream
>>>>>>>>> object is null for COSStream?
>>>>>>>>> ​
>>>>>>>>> 
>>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>>>> bvenkata@tibco.com
>>>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>>>> 
>>>>>>>>>     Thank you for your response Tilman.
>>>>>>>>> 
>>>>>>>>>     I had previously tried using the WriteDecodedDoc for my
>> compressed
>>>>>>>>>     PDF and I tried to get the number of acro form fields present
>> in
>>>>>>>>>  the output file generated by WriteDecodedDoc. The API still could
>>>>>>>>>     not find the acro form fields in the generated decompressed
>> file.
>>>>>>>>>      Also the decompressed file generated is 75 KB which is far
>> less
>>>>>>>>>     than the original decompressed file which I have (1.6 MB)
>> though I
>>>>>>>>>     could edit the acro form fields using acrobat reader.
>>>>>>>>> 
>>>>>>>>>     Thanks,
>>>>>>>>>     Balaji
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>>>     <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>>>>>>> 
>>>>>>>>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>>>> 
>>>>>>>>>             My question is: how do I flatedecode a PDF so that I
>> can
>>>>>>>>>             find all the
>>>>>>>>>             acroform fields within it. ANy help or pointers would
>> be
>>>>>>>>>             highly appreciated.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>         You could try the WriteDecodedDoc option of the command
>> line
>>>>>>>>> app
>>>>>>>>> 
>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>>>> 
>>>>>>>>>         Maybe you can have further ideas by comparing the two
>> files
>>>>>>>>>         with NOTEPAD++.... however the two files might have their
>>>>>>>>>         objects in different order.
>>>>>>>>> 
>>>>>>>>>         Tilman
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>         To unsubscribe, e-mail:
>> users-unsubscribe@pdfbox.apache.org
>>>>>>>>>         <ma...@pdfbox.apache.org>
>>>>>>>>>         For additional commands, e-mail:
>> users-help@pdfbox.apache.org
>>>>>>>>>         <ma...@pdfbox.apache.org>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
>> users-unsubscribe@pdfbox.apache.org>
>>> For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
>> users-help@pdfbox.apache.org>
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan:
> If this is the case, can pdfbox API also use similar compression
> technique to compress such a a huge file into a smaller one?

Yes. Although we don't have it now, the code would be like the one 
WriteDecodedDoc. The current interesting code segment is:

                 if (base instanceof COSStream)
                 {
                     // just kill the filters
                     COSStream cosStream = (COSStream)base;
                     cosStream.getUnfilteredStream();
                     cosStream.setFilters(null);
                 }

new code would be like this (didn't test it)


                 if (base instanceof COSStream)
                 {
                     COSStream cosStream = (COSStream)base;
                     if (cosStream.getFilters() == null ||
                         (cosStream.getFilters() instanceof COSArray && 
((COSArray) cosStream.getFilters()).size == 0))
                     {
                         cosStream.setFilters(COSName.FLATE_DECODE);
                     }
                 }


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
I opened the interview_compressed in notepad++ and did not see any
'Acroform' text anywhere.
However, as Maruan suggested, I entered some data into what looks like form
fields of interview_compressed.pdf and saved it. When I opened this file in
notepad++, I did see 'Acroform' text in it. I also noticed an increase in
file size from 21 KB to ~530 KB.

I then ran this filled saved compressed PDF in pdfdebugger.java and saw
that the field values were getting stored but not under Acroform fields but
under Annotations.
Please refer to this image:

http://imageshack.com/a/img540/9951/QGLDtS.jpg

So, whatever the compression technique was, it simply made all the Acroform
fields disappear from the original PDF but retained all annotations which
also contain the interactive forms and this helped reduce the file size so
much? If this is the case, can pdfbox API also use similar compression
technique to compress such a a huge file into a smaller one?




On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> Hi,
>
> > Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <TH...@t-online.de>:
> >
> > Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> >> Hello,
> >>
> >> I used PdfDebugger to make the internal PDF structure of the two files
> (1)
> >> interview.pdf and (2) interview_compressed.pdf  visually available and I
> >> have uploaded my images to imageshack. Here are the four links:
> >>
> >> http://imageshack.com/a/img538/8277/JghCpG.jpg
> >> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> >> http://imageshack.com/a/img903/8644/mk15As.jpg
> >> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> >> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
> >>
> >> The first two links are from the internal structure of interview.pdf
> >> (original uncompressed file)
> >> The third and fourth links are from the internal structure of
> >> interview_compressed.pdf (compressed file)
> >> The fifth link compares the file sizes of the two files and as you can
> also
> >> see, the difference is huge.
> >>
> >> As you might notice, the file interview_compressed.pdf has no acroform
> >
> > Indeed... but this is needed - from the spec:
> >
> > "The contents and properties of a document’s interactive form shall be
> defined by an interactive form dictionary that shall be referenced from the
> AcroForm entry in the document catalogue (see 7.7.2, “Document Catalog”).
> Table 218 shows the contents of this dictionary."
> >
>
> correct
>
> >> fields listed even though opening the PDF in pdf reader allows me to
> enter
> >> values in places which look like AcroForm fields and also save them. Are
> >> there any other PDF 'types' similar to Acroform fields which would
> enable
> >> users to fill data and which can be accessed in PdfBox APIs without
> having
> >> to go through PDAcrofield?
> >
> > Yes, annotations... there are some common parts, but this is just a
> vague observation from me, I'm not the acroform specialist.
>
> from a first glance it looks like there are all entries necessary to (re-)
> generate the form fields. That's what's likely happening for this document
> in Adobe Reader. Would be interesting to see what's being save after the
> forms has been filled out and saved using Acrobat. We'd need a test form to
> come up with an enhancement like this.
>
> BR
> Maruan
>
>
> >
> > What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in
> the "compressed" file.
> > - if it is missing, tell the client (or your boss) just that
> > - if it isn't missing, then there's some problem in PDFBox (try also the
> loadNonSeq I mentioned earlier)
> >
> > Tilman
> >
> >>
> >> You can use qpdf , then use these options:
> >>
> >> I will now try using this link to compress the original file.
> >>
> >> Another strategy to think about - can your client generate a
> >> non-confidential file, so that you can share it, and the "compressed"
> file?
> >>
> >> I wish I had direct communication with the clients but due to
> bureaucracy,
> >> I am having to go through multiple layers to get my message across to
> them.
> >> I will share more information as soon as I have them.
> >>
> >> PS: i sent these image links to my personal email first to make sure
> that I
> >> can open them. I could and so I am hoping you all could too. If you are
> >> unable to open them, please let me know.
> >>
> >> Thanks,
> >> Balaji
> >>
> >>
> >> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <THausherr@t-online.de
> >
> >> wrote:
> >>
> >>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
> >>>
> >>>> Hi,
> >>>>
> >>>>  Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um
> 03:24
> >>>>> geschrieben:
> >>>>>
> >>>>>
> >>>>> Thank you for your pointers and sorry about the image. I am
> attaching it
> >>>>> with this email.
> >>>>>
> >>>>> The point I am trying to make is that the PDF, which was decompressed
> >>>>> using
> >>>>> WriteDecodedDoc, is smaller in size than the original PDF given to
> us by
> >>>>> our customers.
> >>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
> did
> >>>>> not
> >>>>> have any PDAcroform fields whereas the decompressed PDF given to us
> by
> >>>>> the
> >>>>> customers does contain Acroform fields. Hence I wanted to know how to
> >>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
> >>>>> analyzing COSStream was to check if the decompression of the
> compressed
> >>>>> PDF
> >>>>> was happening correctly while using PDFBox APIs.
> >>>>> I know it would have been difficult for you to help me without the
> actual
> >>>>> PDFs. For that, I would like to thank you for your time and pointers.
> >>>>>
> >>>> Maybe it's worth to try to share the file "visually" with us. Open
> both
> >>>> files
> >>>> (compressed and decompressed) with PDFDebugger [1] and post a
> screenshot
> >>>> of both
> >>>> somehwere (dropbox etc.) and share the link with us. Maybe that could
> >>>> shed some
> >>>> light on your issue.
> >>>>
> >>> @Balaji: here's an example on how such a screenshot would look like:
> >>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
> >>>
> >>> Tilman
> >>>
> >>>
> >>>
> >>>> BR
> >>>> Andreas Lehmkühler
> >>>>
> >>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
> >>>>
> >>>>  On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
> THausherr@t-online.de>
> >>>>> wrote:
> >>>>>
> >>>>>  Hi,
> >>>>>> The image doesn't appear in the mailing list.
> >>>>>>
> >>>>>> This is all very confusing... /acroform is in the document catalog.
> I
> >>>>>> don't see how the page content stream is related to it. The best is
> that
> >>>>>> you either go through the source code, or read the spec and then
> look at
> >>>>>> the pdf.
> >>>>>>
> >>>>>> To find out what's going on, you'd have to start from that /acroform
> >>>>>> entry
> >>>>>> and then compare the two files.
> >>>>>>
> >>>>>> It is really difficult to help you without the files. The cause
> could
> >>>>>> be a
> >>>>>> bug in pdfbox, or a malformed pdf...
> >>>>>>
> >>>>>> Some more ideas:
> >>>>>> - use loadNonSeq(file, null) instead of load(file)
> >>>>>> - try the unreleased 2.0 version, that one has some improvements in
> the
> >>>>>> acroform stuff. Note that the API is different.
> >>>>>> https://pdfbox.apache.org/download.cgi#scm
> >>>>>> https://pdfbox.apache.org/2.0/getting-started.html
> >>>>>>
> >>>>>> If you still need help, one possibility would be 1) post the
> smallest
> >>>>>> possible code that fails, and 2) post a small part of the raw PDF,
> i.e.
> >>>>>> the
> >>>>>> objects relevant to the field in your code.
> >>>>>>
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>>
> >>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> >>>>>>
> >>>>>>  Moreover, for every page of the compressed PDF (there are 3
> pages), I
> >>>>>>> tried getting the COSStream for each of the page :
> >>>>>>>
> >>>>>>> PDPage firstPage=(PDPage)
> >>>>>>> document.getDocumentCatalog().getAllPages().get(0);
> >>>>>>>              pdStream=firstPage.getContents();
> >>>>>>>              COSStream stream=pdStream.getStream();
> >>>>>>>
> >>>>>>> In the above code snippet, the object stream, when analyzed in
> debug
> >>>>>>> mode, has the following:
> >>>>>>>
> >>>>>>>
> >>>>>>> The line from the compressed PDF as opened with Notepad++ is :
> >>>>>>>
> >>>>>>> <</Filter/FlateDecode/Length 5675>>stream
> >>>>>>>
> >>>>>>>  From this point on, using the COSStream object for every page, how
> >>>>>>> can I
> >>>>>>> decompress and find out the acroform fields given that the
> >>>>>>> unFilteredStream
> >>>>>>> object is null for COSStream?
> >>>>>>> ​
> >>>>>>>
> >>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
> >>>>>>> bvenkata@tibco.com
> >>>>>>> <ma...@tibco.com>> wrote:
> >>>>>>>
> >>>>>>>      Thank you for your response Tilman.
> >>>>>>>
> >>>>>>>      I had previously tried using the WriteDecodedDoc for my
> compressed
> >>>>>>>      PDF and I tried to get the number of acro form fields present
> in
> >>>>>>>   the output file generated by WriteDecodedDoc. The API still could
> >>>>>>>      not find the acro form fields in the generated decompressed
> file.
> >>>>>>>       Also the decompressed file generated is 75 KB which is far
> less
> >>>>>>>      than the original decompressed file which I have (1.6 MB)
> though I
> >>>>>>>      could edit the acro form fields using acrobat reader.
> >>>>>>>
> >>>>>>>      Thanks,
> >>>>>>>      Balaji
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>      On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
> >>>>>>>      <THausherr@t-online.de <ma...@t-online.de>> wrote:
> >>>>>>>
> >>>>>>>          Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
> >>>>>>>
> >>>>>>>              My question is: how do I flatedecode a PDF so that I
> can
> >>>>>>>              find all the
> >>>>>>>              acroform fields within it. ANy help or pointers would
> be
> >>>>>>>              highly appreciated.
> >>>>>>>
> >>>>>>>
> >>>>>>>          You could try the WriteDecodedDoc option of the command
> line
> >>>>>>> app
> >>>>>>>
> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
> >>>>>>>
> >>>>>>>          Maybe you can have further ideas by comparing the two
> files
> >>>>>>>          with NOTEPAD++.... however the two files might have their
> >>>>>>>          objects in different order.
> >>>>>>>
> >>>>>>>          Tilman
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>          To unsubscribe, e-mail:
> users-unsubscribe@pdfbox.apache.org
> >>>>>>>          <ma...@pdfbox.apache.org>
> >>>>>>>          For additional commands, e-mail:
> users-help@pdfbox.apache.org
> >>>>>>>          <ma...@pdfbox.apache.org>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <mailto:
> users-unsubscribe@pdfbox.apache.org>
> > For additional commands, e-mail: users-help@pdfbox.apache.org <mailto:
> users-help@pdfbox.apache.org>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <TH...@t-online.de>:
> 
> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
>> Hello,
>> 
>> I used PdfDebugger to make the internal PDF structure of the two files (1)
>> interview.pdf and (2) interview_compressed.pdf  visually available and I
>> have uploaded my images to imageshack. Here are the four links:
>> 
>> http://imageshack.com/a/img538/8277/JghCpG.jpg
>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
>> http://imageshack.com/a/img903/8644/mk15As.jpg
>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>> 
>> The first two links are from the internal structure of interview.pdf
>> (original uncompressed file)
>> The third and fourth links are from the internal structure of
>> interview_compressed.pdf (compressed file)
>> The fifth link compares the file sizes of the two files and as you can also
>> see, the difference is huge.
>> 
>> As you might notice, the file interview_compressed.pdf has no acroform
> 
> Indeed... but this is needed - from the spec:
> 
> "The contents and properties of a document’s interactive form shall be defined by an interactive form dictionary that shall be referenced from the AcroForm entry in the document catalogue (see 7.7.2, “Document Catalog”). Table 218 shows the contents of this dictionary."
> 

correct

>> fields listed even though opening the PDF in pdf reader allows me to enter
>> values in places which look like AcroForm fields and also save them. Are
>> there any other PDF 'types' similar to Acroform fields which would enable
>> users to fill data and which can be accessed in PdfBox APIs without having
>> to go through PDAcrofield?
> 
> Yes, annotations... there are some common parts, but this is just a vague observation from me, I'm not the acroform specialist.

from a first glance it looks like there are all entries necessary to (re-) generate the form fields. That's what's likely happening for this document in Adobe Reader. Would be interesting to see what's being save after the forms has been filled out and saved using Acrobat. We'd need a test form to come up with an enhancement like this.

BR
Maruan


> 
> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in the "compressed" file.
> - if it is missing, tell the client (or your boss) just that
> - if it isn't missing, then there's some problem in PDFBox (try also the loadNonSeq I mentioned earlier)
> 
> Tilman
> 
>> 
>> You can use qpdf , then use these options:
>> 
>> I will now try using this link to compress the original file.
>> 
>> Another strategy to think about - can your client generate a
>> non-confidential file, so that you can share it, and the "compressed" file?
>> 
>> I wish I had direct communication with the clients but due to bureaucracy,
>> I am having to go through multiple layers to get my message across to them.
>> I will share more information as soon as I have them.
>> 
>> PS: i sent these image links to my personal email first to make sure that I
>> can open them. I could and so I am hoping you all could too. If you are
>> unable to open them, please let me know.
>> 
>> Thanks,
>> Balaji
>> 
>> 
>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>> 
>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>> 
>>>> Hi,
>>>> 
>>>>  Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um 03:24
>>>>> geschrieben:
>>>>> 
>>>>> 
>>>>> Thank you for your pointers and sorry about the image. I am attaching it
>>>>> with this email.
>>>>> 
>>>>> The point I am trying to make is that the PDF, which was decompressed
>>>>> using
>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to us by
>>>>> our customers.
>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did
>>>>> not
>>>>> have any PDAcroform fields whereas the decompressed PDF given to us by
>>>>> the
>>>>> customers does contain Acroform fields. Hence I wanted to know how to
>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>>>> analyzing COSStream was to check if the decompression of the compressed
>>>>> PDF
>>>>> was happening correctly while using PDFBox APIs.
>>>>> I know it would have been difficult for you to help me without the actual
>>>>> PDFs. For that, I would like to thank you for your time and pointers.
>>>>> 
>>>> Maybe it's worth to try to share the file "visually" with us. Open both
>>>> files
>>>> (compressed and decompressed) with PDFDebugger [1] and post a screenshot
>>>> of both
>>>> somehwere (dropbox etc.) and share the link with us. Maybe that could
>>>> shed some
>>>> light on your issue.
>>>> 
>>> @Balaji: here's an example on how such a screenshot would look like:
>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>> 
>>> Tilman
>>> 
>>> 
>>> 
>>>> BR
>>>> Andreas Lehmkühler
>>>> 
>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>> 
>>>>  On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <TH...@t-online.de>
>>>>> wrote:
>>>>> 
>>>>>  Hi,
>>>>>> The image doesn't appear in the mailing list.
>>>>>> 
>>>>>> This is all very confusing... /acroform is in the document catalog. I
>>>>>> don't see how the page content stream is related to it. The best is that
>>>>>> you either go through the source code, or read the spec and then look at
>>>>>> the pdf.
>>>>>> 
>>>>>> To find out what's going on, you'd have to start from that /acroform
>>>>>> entry
>>>>>> and then compare the two files.
>>>>>> 
>>>>>> It is really difficult to help you without the files. The cause could
>>>>>> be a
>>>>>> bug in pdfbox, or a malformed pdf...
>>>>>> 
>>>>>> Some more ideas:
>>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>>> - try the unreleased 2.0 version, that one has some improvements in the
>>>>>> acroform stuff. Note that the API is different.
>>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>> 
>>>>>> If you still need help, one possibility would be 1) post the smallest
>>>>>> possible code that fails, and 2) post a small part of the raw PDF, i.e.
>>>>>> the
>>>>>> objects relevant to the field in your code.
>>>>>> 
>>>>>> 
>>>>>> Tilman
>>>>>> 
>>>>>> 
>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>> 
>>>>>>  Moreover, for every page of the compressed PDF (there are 3 pages), I
>>>>>>> tried getting the COSStream for each of the page :
>>>>>>> 
>>>>>>> PDPage firstPage=(PDPage)
>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>>              pdStream=firstPage.getContents();
>>>>>>>              COSStream stream=pdStream.getStream();
>>>>>>> 
>>>>>>> In the above code snippet, the object stream, when analyzed in debug
>>>>>>> mode, has the following:
>>>>>>> 
>>>>>>> 
>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>> 
>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>> 
>>>>>>>  From this point on, using the COSStream object for every page, how
>>>>>>> can I
>>>>>>> decompress and find out the acroform fields given that the
>>>>>>> unFilteredStream
>>>>>>> object is null for COSStream?
>>>>>>> ​
>>>>>>> 
>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>>> bvenkata@tibco.com
>>>>>>> <ma...@tibco.com>> wrote:
>>>>>>> 
>>>>>>>      Thank you for your response Tilman.
>>>>>>> 
>>>>>>>      I had previously tried using the WriteDecodedDoc for my compressed
>>>>>>>      PDF and I tried to get the number of acro form fields present in
>>>>>>>   the output file generated by WriteDecodedDoc. The API still could
>>>>>>>      not find the acro form fields in the generated decompressed file.
>>>>>>>       Also the decompressed file generated is 75 KB which is far less
>>>>>>>      than the original decompressed file which I have (1.6 MB) though I
>>>>>>>      could edit the acro form fields using acrobat reader.
>>>>>>> 
>>>>>>>      Thanks,
>>>>>>>      Balaji
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>      On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>>      <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>>>>> 
>>>>>>>          Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>> 
>>>>>>>              My question is: how do I flatedecode a PDF so that I can
>>>>>>>              find all the
>>>>>>>              acroform fields within it. ANy help or pointers would be
>>>>>>>              highly appreciated.
>>>>>>> 
>>>>>>> 
>>>>>>>          You could try the WriteDecodedDoc option of the command line
>>>>>>> app
>>>>>>>          https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>> 
>>>>>>>          Maybe you can have further ideas by comparing the two files
>>>>>>>          with NOTEPAD++.... however the two files might have their
>>>>>>>          objects in different order.
>>>>>>> 
>>>>>>>          Tilman
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>          To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>          <ma...@pdfbox.apache.org>
>>>>>>>          For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>          <ma...@pdfbox.apache.org>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> Hello,
>
> I used PdfDebugger to make the internal PDF structure of the two files (1)
> interview.pdf and (2) interview_compressed.pdf  visually available and I
> have uploaded my images to imageshack. Here are the four links:
>
> http://imageshack.com/a/img538/8277/JghCpG.jpg
> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> http://imageshack.com/a/img903/8644/mk15As.jpg
> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
>
> The first two links are from the internal structure of interview.pdf
> (original uncompressed file)
> The third and fourth links are from the internal structure of
> interview_compressed.pdf (compressed file)
> The fifth link compares the file sizes of the two files and as you can also
> see, the difference is huge.
>
> As you might notice, the file interview_compressed.pdf has no acroform

Indeed... but this is needed - from the spec:

"The contents and properties of a document’s interactive form shall be 
defined by an interactive form dictionary that shall be referenced from 
the AcroForm entry in the document catalogue (see 7.7.2, “Document 
Catalog”). Table 218 shows the contents of this dictionary."

> fields listed even though opening the PDF in pdf reader allows me to enter
> values in places which look like AcroForm fields and also save them. Are
> there any other PDF 'types' similar to Acroform fields which would enable
> users to fill data and which can be accessed in PdfBox APIs without having
> to go through PDAcrofield?

Yes, annotations... there are some common parts, but this is just a 
vague observation from me, I'm not the acroform specialist.

What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in 
the "compressed" file.
- if it is missing, tell the client (or your boss) just that
- if it isn't missing, then there's some problem in PDFBox (try also the 
loadNonSeq I mentioned earlier)

Tilman

>
> You can use qpdf , then use these options:
>
> I will now try using this link to compress the original file.
>
> Another strategy to think about - can your client generate a
> non-confidential file, so that you can share it, and the "compressed" file?
>
> I wish I had direct communication with the clients but due to bureaucracy,
> I am having to go through multiple layers to get my message across to them.
> I will share more information as soon as I have them.
>
> PS: i sent these image links to my personal email first to make sure that I
> can open them. I could and so I am hoping you all could too. If you are
> unable to open them, please let me know.
>
> Thanks,
> Balaji
>
>
> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>>
>>> Hi,
>>>
>>>   Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um 03:24
>>>> geschrieben:
>>>>
>>>>
>>>> Thank you for your pointers and sorry about the image. I am attaching it
>>>> with this email.
>>>>
>>>> The point I am trying to make is that the PDF, which was decompressed
>>>> using
>>>> WriteDecodedDoc, is smaller in size than the original PDF given to us by
>>>> our customers.
>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did
>>>> not
>>>> have any PDAcroform fields whereas the decompressed PDF given to us by
>>>> the
>>>> customers does contain Acroform fields. Hence I wanted to know how to
>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>>> analyzing COSStream was to check if the decompression of the compressed
>>>> PDF
>>>> was happening correctly while using PDFBox APIs.
>>>> I know it would have been difficult for you to help me without the actual
>>>> PDFs. For that, I would like to thank you for your time and pointers.
>>>>
>>> Maybe it's worth to try to share the file "visually" with us. Open both
>>> files
>>> (compressed and decompressed) with PDFDebugger [1] and post a screenshot
>>> of both
>>> somehwere (dropbox etc.) and share the link with us. Maybe that could
>>> shed some
>>> light on your issue.
>>>
>> @Balaji: here's an example on how such a screenshot would look like:
>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>>
>> Tilman
>>
>>
>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>>
>>>   On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <TH...@t-online.de>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>> The image doesn't appear in the mailing list.
>>>>>
>>>>> This is all very confusing... /acroform is in the document catalog. I
>>>>> don't see how the page content stream is related to it. The best is that
>>>>> you either go through the source code, or read the spec and then look at
>>>>> the pdf.
>>>>>
>>>>> To find out what's going on, you'd have to start from that /acroform
>>>>> entry
>>>>> and then compare the two files.
>>>>>
>>>>> It is really difficult to help you without the files. The cause could
>>>>> be a
>>>>> bug in pdfbox, or a malformed pdf...
>>>>>
>>>>> Some more ideas:
>>>>> - use loadNonSeq(file, null) instead of load(file)
>>>>> - try the unreleased 2.0 version, that one has some improvements in the
>>>>> acroform stuff. Note that the API is different.
>>>>> https://pdfbox.apache.org/download.cgi#scm
>>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>>
>>>>> If you still need help, one possibility would be 1) post the smallest
>>>>> possible code that fails, and 2) post a small part of the raw PDF, i.e.
>>>>> the
>>>>> objects relevant to the field in your code.
>>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>>
>>>>>   Moreover, for every page of the compressed PDF (there are 3 pages), I
>>>>>> tried getting the COSStream for each of the page :
>>>>>>
>>>>>> PDPage firstPage=(PDPage)
>>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>>               pdStream=firstPage.getContents();
>>>>>>               COSStream stream=pdStream.getStream();
>>>>>>
>>>>>> In the above code snippet, the object stream, when analyzed in debug
>>>>>> mode, has the following:
>>>>>>
>>>>>>
>>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>>
>>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>>
>>>>>>   From this point on, using the COSStream object for every page, how
>>>>>> can I
>>>>>> decompress and find out the acroform fields given that the
>>>>>> unFilteredStream
>>>>>> object is null for COSStream?
>>>>>> ​
>>>>>>
>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>>> bvenkata@tibco.com
>>>>>> <ma...@tibco.com>> wrote:
>>>>>>
>>>>>>       Thank you for your response Tilman.
>>>>>>
>>>>>>       I had previously tried using the WriteDecodedDoc for my compressed
>>>>>>       PDF and I tried to get the number of acro form fields present in
>>>>>>    the output file generated by WriteDecodedDoc. The API still could
>>>>>>       not find the acro form fields in the generated decompressed file.
>>>>>>        Also the decompressed file generated is 75 KB which is far less
>>>>>>       than the original decompressed file which I have (1.6 MB) though I
>>>>>>       could edit the acro form fields using acrobat reader.
>>>>>>
>>>>>>       Thanks,
>>>>>>       Balaji
>>>>>>
>>>>>>
>>>>>>
>>>>>>       On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>>       <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>>>>
>>>>>>           Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>>
>>>>>>               My question is: how do I flatedecode a PDF so that I can
>>>>>>               find all the
>>>>>>               acroform fields within it. ANy help or pointers would be
>>>>>>               highly appreciated.
>>>>>>
>>>>>>
>>>>>>           You could try the WriteDecodedDoc option of the command line
>>>>>> app
>>>>>>           https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>>
>>>>>>           Maybe you can have further ideas by comparing the two files
>>>>>>           with NOTEPAD++.... however the two files might have their
>>>>>>           objects in different order.
>>>>>>
>>>>>>           Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>           To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>           <ma...@pdfbox.apache.org>
>>>>>>           For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>           <ma...@pdfbox.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Hello,

I used PdfDebugger to make the internal PDF structure of the two files (1)
interview.pdf and (2) interview_compressed.pdf  visually available and I
have uploaded my images to imageshack. Here are the four links:

http://imageshack.com/a/img538/8277/JghCpG.jpg
http://imageshack.com/a/img909/6140/KsYNGR.jpg
http://imageshack.com/a/img903/8644/mk15As.jpg
http://imageshack.com/a/img901/8610/NXe3mJ.jpg
http://imageshack.com/a/img673/8633/0GMdjQ.jpg

The first two links are from the internal structure of interview.pdf
(original uncompressed file)
The third and fourth links are from the internal structure of
interview_compressed.pdf (compressed file)
The fifth link compares the file sizes of the two files and as you can also
see, the difference is huge.

As you might notice, the file interview_compressed.pdf has no acroform
fields listed even though opening the PDF in pdf reader allows me to enter
values in places which look like AcroForm fields and also save them. Are
there any other PDF 'types' similar to Acroform fields which would enable
users to fill data and which can be accessed in PdfBox APIs without having
to go through PDAcrofield?

You can use qpdf , then use these options:

I will now try using this link to compress the original file.

Another strategy to think about - can your client generate a
non-confidential file, so that you can share it, and the "compressed" file?

I wish I had direct communication with the clients but due to bureaucracy,
I am having to go through multiple layers to get my message across to them.
I will share more information as soon as I have them.

PS: i sent these image links to my personal email first to make sure that I
can open them. I could and so I am hoping you all could too. If you are
unable to open them, please let me know.

Thanks,
Balaji


On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
>
>> Hi,
>>
>>  Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um 03:24
>>> geschrieben:
>>>
>>>
>>> Thank you for your pointers and sorry about the image. I am attaching it
>>> with this email.
>>>
>>> The point I am trying to make is that the PDF, which was decompressed
>>> using
>>> WriteDecodedDoc, is smaller in size than the original PDF given to us by
>>> our customers.
>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did
>>> not
>>> have any PDAcroform fields whereas the decompressed PDF given to us by
>>> the
>>> customers does contain Acroform fields. Hence I wanted to know how to
>>> properly decompress the PDF using pdfbox APIs. The reason why I was
>>> analyzing COSStream was to check if the decompression of the compressed
>>> PDF
>>> was happening correctly while using PDFBox APIs.
>>> I know it would have been difficult for you to help me without the actual
>>> PDFs. For that, I would like to thank you for your time and pointers.
>>>
>> Maybe it's worth to try to share the file "visually" with us. Open both
>> files
>> (compressed and decompressed) with PDFDebugger [1] and post a screenshot
>> of both
>> somehwere (dropbox etc.) and share the link with us. Maybe that could
>> shed some
>> light on your issue.
>>
>
> @Balaji: here's an example on how such a screenshot would look like:
> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
>
> Tilman
>
>
>
>> BR
>> Andreas Lehmkühler
>>
>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>>
>>  On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>  Hi,
>>>>
>>>> The image doesn't appear in the mailing list.
>>>>
>>>> This is all very confusing... /acroform is in the document catalog. I
>>>> don't see how the page content stream is related to it. The best is that
>>>> you either go through the source code, or read the spec and then look at
>>>> the pdf.
>>>>
>>>> To find out what's going on, you'd have to start from that /acroform
>>>> entry
>>>> and then compare the two files.
>>>>
>>>> It is really difficult to help you without the files. The cause could
>>>> be a
>>>> bug in pdfbox, or a malformed pdf...
>>>>
>>>> Some more ideas:
>>>> - use loadNonSeq(file, null) instead of load(file)
>>>> - try the unreleased 2.0 version, that one has some improvements in the
>>>> acroform stuff. Note that the API is different.
>>>> https://pdfbox.apache.org/download.cgi#scm
>>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>>
>>>> If you still need help, one possibility would be 1) post the smallest
>>>> possible code that fails, and 2) post a small part of the raw PDF, i.e.
>>>> the
>>>> objects relevant to the field in your code.
>>>>
>>>>
>>>> Tilman
>>>>
>>>>
>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>>
>>>>  Moreover, for every page of the compressed PDF (there are 3 pages), I
>>>>> tried getting the COSStream for each of the page :
>>>>>
>>>>> PDPage firstPage=(PDPage)
>>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>>              pdStream=firstPage.getContents();
>>>>>              COSStream stream=pdStream.getStream();
>>>>>
>>>>> In the above code snippet, the object stream, when analyzed in debug
>>>>> mode, has the following:
>>>>>
>>>>>
>>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>>
>>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>>
>>>>>  From this point on, using the COSStream object for every page, how
>>>>> can I
>>>>> decompress and find out the acroform fields given that the
>>>>> unFilteredStream
>>>>> object is null for COSStream?
>>>>> ​
>>>>>
>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
>>>>> bvenkata@tibco.com
>>>>> <ma...@tibco.com>> wrote:
>>>>>
>>>>>      Thank you for your response Tilman.
>>>>>
>>>>>      I had previously tried using the WriteDecodedDoc for my compressed
>>>>>      PDF and I tried to get the number of acro form fields present in
>>>>>   the output file generated by WriteDecodedDoc. The API still could
>>>>>      not find the acro form fields in the generated decompressed file.
>>>>>       Also the decompressed file generated is 75 KB which is far less
>>>>>      than the original decompressed file which I have (1.6 MB) though I
>>>>>      could edit the acro form fields using acrobat reader.
>>>>>
>>>>>      Thanks,
>>>>>      Balaji
>>>>>
>>>>>
>>>>>
>>>>>      On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>>      <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>>>
>>>>>          Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>>
>>>>>              My question is: how do I flatedecode a PDF so that I can
>>>>>              find all the
>>>>>              acroform fields within it. ANy help or pointers would be
>>>>>              highly appreciated.
>>>>>
>>>>>
>>>>>          You could try the WriteDecodedDoc option of the command line
>>>>> app
>>>>>          https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>>
>>>>>          Maybe you can have further ideas by comparing the two files
>>>>>          with NOTEPAD++.... however the two files might have their
>>>>>          objects in different order.
>>>>>
>>>>>          Tilman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>          To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>          <ma...@pdfbox.apache.org>
>>>>>          For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>          <ma...@pdfbox.apache.org>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
> Hi,
>
>> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um 03:24
>> geschrieben:
>>
>>
>> Thank you for your pointers and sorry about the image. I am attaching it
>> with this email.
>>
>> The point I am trying to make is that the PDF, which was decompressed using
>> WriteDecodedDoc, is smaller in size than the original PDF given to us by
>> our customers.
>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did not
>> have any PDAcroform fields whereas the decompressed PDF given to us by the
>> customers does contain Acroform fields. Hence I wanted to know how to
>> properly decompress the PDF using pdfbox APIs. The reason why I was
>> analyzing COSStream was to check if the decompression of the compressed PDF
>> was happening correctly while using PDFBox APIs.
>> I know it would have been difficult for you to help me without the actual
>> PDFs. For that, I would like to thank you for your time and pointers.
> Maybe it's worth to try to share the file "visually" with us. Open both files
> (compressed and decompressed) with PDFDebugger [1] and post a screenshot of both
> somehwere (dropbox etc.) and share the link with us. Maybe that could shed some
> light on your issue.

@Balaji: here's an example on how such a screenshot would look like:
http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png

Tilman

>
> BR
> Andreas Lehmkühler
>
> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
>
>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Hi,
>>>
>>> The image doesn't appear in the mailing list.
>>>
>>> This is all very confusing... /acroform is in the document catalog. I
>>> don't see how the page content stream is related to it. The best is that
>>> you either go through the source code, or read the spec and then look at
>>> the pdf.
>>>
>>> To find out what's going on, you'd have to start from that /acroform entry
>>> and then compare the two files.
>>>
>>> It is really difficult to help you without the files. The cause could be a
>>> bug in pdfbox, or a malformed pdf...
>>>
>>> Some more ideas:
>>> - use loadNonSeq(file, null) instead of load(file)
>>> - try the unreleased 2.0 version, that one has some improvements in the
>>> acroform stuff. Note that the API is different.
>>> https://pdfbox.apache.org/download.cgi#scm
>>> https://pdfbox.apache.org/2.0/getting-started.html
>>>
>>> If you still need help, one possibility would be 1) post the smallest
>>> possible code that fails, and 2) post a small part of the raw PDF, i.e. the
>>> objects relevant to the field in your code.
>>>
>>>
>>> Tilman
>>>
>>>
>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>>
>>>> Moreover, for every page of the compressed PDF (there are 3 pages), I
>>>> tried getting the COSStream for each of the page :
>>>>
>>>> PDPage firstPage=(PDPage)
>>>> document.getDocumentCatalog().getAllPages().get(0);
>>>>              pdStream=firstPage.getContents();
>>>>              COSStream stream=pdStream.getStream();
>>>>
>>>> In the above code snippet, the object stream, when analyzed in debug
>>>> mode, has the following:
>>>>
>>>>
>>>> The line from the compressed PDF as opened with Notepad++ is :
>>>>
>>>> <</Filter/FlateDecode/Length 5675>>stream
>>>>
>>>>  From this point on, using the COSStream object for every page, how can I
>>>> decompress and find out the acroform fields given that the unFilteredStream
>>>> object is null for COSStream?
>>>> ​
>>>>
>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <bvenkata@tibco.com
>>>> <ma...@tibco.com>> wrote:
>>>>
>>>>      Thank you for your response Tilman.
>>>>
>>>>      I had previously tried using the WriteDecodedDoc for my compressed
>>>>      PDF and I tried to get the number of acro form fields present in
>>>>   the output file generated by WriteDecodedDoc. The API still could
>>>>      not find the acro form fields in the generated decompressed file.
>>>>       Also the decompressed file generated is 75 KB which is far less
>>>>      than the original decompressed file which I have (1.6 MB) though I
>>>>      could edit the acro form fields using acrobat reader.
>>>>
>>>>      Thanks,
>>>>      Balaji
>>>>
>>>>
>>>>
>>>>      On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>>>      <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>>>
>>>>          Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>>>
>>>>              My question is: how do I flatedecode a PDF so that I can
>>>>              find all the
>>>>              acroform fields within it. ANy help or pointers would be
>>>>              highly appreciated.
>>>>
>>>>
>>>>          You could try the WriteDecodedDoc option of the command line app
>>>>          https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>>>
>>>>          Maybe you can have further ideas by comparing the two files
>>>>          with NOTEPAD++.... however the two files might have their
>>>>          objects in different order.
>>>>
>>>>          Tilman
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>          To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>          <ma...@pdfbox.apache.org>
>>>>          For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>          <ma...@pdfbox.apache.org>
>>>>
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi,

> Balaji Venkatamohan <bv...@tibco.com> hat am 20. Mai 2015 um 03:24
> geschrieben:
> 
> 
> Thank you for your pointers and sorry about the image. I am attaching it
> with this email.
> 
> The point I am trying to make is that the PDF, which was decompressed using
> WriteDecodedDoc, is smaller in size than the original PDF given to us by
> our customers.
> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did not
> have any PDAcroform fields whereas the decompressed PDF given to us by the
> customers does contain Acroform fields. Hence I wanted to know how to
> properly decompress the PDF using pdfbox APIs. The reason why I was
> analyzing COSStream was to check if the decompression of the compressed PDF
> was happening correctly while using PDFBox APIs.
> I know it would have been difficult for you to help me without the actual
> PDFs. For that, I would like to thank you for your time and pointers.
Maybe it's worth to try to share the file "visually" with us. Open both files
(compressed and decompressed) with PDFDebugger [1] and post a screenshot of both
somehwere (dropbox etc.) and share the link with us. Maybe that could shed some
light on your issue.

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger

> 
> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
> 
> > Hi,
> >
> > The image doesn't appear in the mailing list.
> >
> > This is all very confusing... /acroform is in the document catalog. I
> > don't see how the page content stream is related to it. The best is that
> > you either go through the source code, or read the spec and then look at
> > the pdf.
> >
> > To find out what's going on, you'd have to start from that /acroform entry
> > and then compare the two files.
> >
> > It is really difficult to help you without the files. The cause could be a
> > bug in pdfbox, or a malformed pdf...
> >
> > Some more ideas:
> > - use loadNonSeq(file, null) instead of load(file)
> > - try the unreleased 2.0 version, that one has some improvements in the
> > acroform stuff. Note that the API is different.
> > https://pdfbox.apache.org/download.cgi#scm
> > https://pdfbox.apache.org/2.0/getting-started.html
> >
> > If you still need help, one possibility would be 1) post the smallest
> > possible code that fails, and 2) post a small part of the raw PDF, i.e. the
> > objects relevant to the field in your code.
> >
> >
> > Tilman
> >
> >
> > Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> >
> >> Moreover, for every page of the compressed PDF (there are 3 pages), I
> >> tried getting the COSStream for each of the page :
> >>
> >> PDPage firstPage=(PDPage)
> >> document.getDocumentCatalog().getAllPages().get(0);
> >>             pdStream=firstPage.getContents();
> >>             COSStream stream=pdStream.getStream();
> >>
> >> In the above code snippet, the object stream, when analyzed in debug
> >> mode, has the following:
> >>
> >>
> >> The line from the compressed PDF as opened with Notepad++ is :
> >>
> >> <</Filter/FlateDecode/Length 5675>>stream
> >>
> >> From this point on, using the COSStream object for every page, how can I
> >> decompress and find out the acroform fields given that the unFilteredStream
> >> object is null for COSStream?
> >> ​
> >>
> >> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <bvenkata@tibco.com
> >> <ma...@tibco.com>> wrote:
> >>
> >>     Thank you for your response Tilman.
> >>
> >>     I had previously tried using the WriteDecodedDoc for my compressed
> >>     PDF and I tried to get the number of acro form fields present in
> >>  the output file generated by WriteDecodedDoc. The API still could
> >>     not find the acro form fields in the generated decompressed file.
> >>      Also the decompressed file generated is 75 KB which is far less
> >>     than the original decompressed file which I have (1.6 MB) though I
> >>     could edit the acro form fields using acrobat reader.
> >>
> >>     Thanks,
> >>     Balaji
> >>
> >>
> >>
> >>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
> >>     <THausherr@t-online.de <ma...@t-online.de>> wrote:
> >>
> >>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
> >>
> >>             My question is: how do I flatedecode a PDF so that I can
> >>             find all the
> >>             acroform fields within it. ANy help or pointers would be
> >>             highly appreciated.
> >>
> >>
> >>         You could try the WriteDecodedDoc option of the command line app
> >>         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
> >>
> >>         Maybe you can have further ideas by comparing the two files
> >>         with NOTEPAD++.... however the two files might have their
> >>         objects in different order.
> >>
> >>         Tilman
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >>         To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>         <ma...@pdfbox.apache.org>
> >>         For additional commands, e-mail: users-help@pdfbox.apache.org
> >>         <ma...@pdfbox.apache.org>
> >>
> >>
> >>
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 22.05.2015 um 01:59 schrieb Balaji Venkatamohan:
> I will have to use Adobe Acrobat Pro to compress the uncompressed PDF to
> see if this results in a resultant PDF which is close to 21 KB but I do not
> have the full Adobe software.

You can use qpdf , then use these options:
http://qpdf.sourceforge.net/

qpdf --stream-data=compress  source-file  dest-file



Another strategy to think about - can your client generate a 
non-confidential file, so that you can share it, and the "compressed" file?

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Hi,

How was the decompressing of the PDF from your customer done - did your
customer also use PDFBox? Or something else?

I could not get the answer to this question from the customer. I am waiting
for a response from them. Its not pdfbox API or itextpdf API and its not
any online tools available for compression as well. I verified this by
trying to compress the uncompressed file sent by customer using pdfbox API,
itextpdf  and the online tools. The size of  resultant compressed file for
all the three methods mentioned above is close to 1.60 MB but the
compressed file sent by customer is only 21 KB!
I will have to use Adobe Acrobat Pro to compress the uncompressed PDF to
see if this results in a resultant PDF which is close to 21 KB but I do not
have the full Adobe software.

And I read in the first post that the decompressed customer file was OK,
but not the compressed file...  so the problem is to find if something is
missing in the compressed file, or if PDFBox has a bug causing to miss it.

Both the decompressed (orignial) and the compressed file are okay when
opened with a PDF reader software, that is, I am able to edit the acroform
fields manually and save them too. The problem is that when using pdfbox
API, only the decompressed (original) file's acroform fields are read. When
I use the compressed file, pdfbox is not able to retrieve any of the
acroform fields and the API call PDDocumentCatalog.getAcroForm() returns
null.

Today, I used iTextPDF API to read acroform fields from the compressed PDF
sent by the customer and even their API could not locate the acroform
fields in the compressed PDF.
I will now try with pdfbox 2.0.0 API. I will share with you the compression
technique used by the customer as soon as they get back to me.

Thank you,
Balaji


On Tue, May 19, 2015 at 10:52 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> "How to properly decompress the PDF using pdfbox APIs" - see the source
> code of WriteDecodedDoc:
>
> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/WriteDecodedDoc.java?view=markup&sortby=date
>
> How was the decompressing of the PDF from your customer done - did your
> customer also use PDFBox? Or something else?
>
> And I read in the first post that the decompressed customer file was OK,
> but not the compressed file...  so the problem is to find if something is
> missing in the compressed file, or if PDFBox has a bug causing to miss it.
>
> Tilman
>
> PS: image didn't go through. Maybe upload it to imageshack.us.
>
>
> Am 20.05.2015 um 03:24 schrieb Balaji Venkatamohan:
>
>> Thank you for your pointers and sorry about the image. I am attaching it
>> with this email.
>>
>> The point I am trying to make is that the PDF, which was decompressed
>> using WriteDecodedDoc, is smaller in size than the original PDF given to us
>> by our customers.
>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did
>> not have any PDAcroform fields whereas the decompressed PDF given to us by
>> the customers does contain Acroform fields. Hence I wanted to know how to
>> properly decompress the PDF using pdfbox APIs. The reason why I was
>> analyzing COSStream was to check if the decompression of the compressed PDF
>> was happening correctly while using PDFBox APIs.
>> I know it would have been difficult for you to help me without the actual
>> PDFs. For that, I would like to thank you for your time and pointers.
>>
>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <THausherr@t-online.de
>> <ma...@t-online.de>> wrote:
>>
>>     Hi,
>>
>>     The image doesn't appear in the mailing list.
>>
>>     This is all very confusing... /acroform is in the document
>>     catalog. I don't see how the page content stream is related to it.
>>     The best is that you either go through the source code, or read
>>     the spec and then look at the pdf.
>>
>>     To find out what's going on, you'd have to start from that
>>     /acroform entry and then compare the two files.
>>
>>     It is really difficult to help you without the files. The cause
>>     could be a bug in pdfbox, or a malformed pdf...
>>
>>     Some more ideas:
>>     - use loadNonSeq(file, null) instead of load(file)
>>     - try the unreleased 2.0 version, that one has some improvements
>>     in the acroform stuff. Note that the API is different.
>>     https://pdfbox.apache.org/download.cgi#scm
>>     https://pdfbox.apache.org/2.0/getting-started.html
>>
>>     If you still need help, one possibility would be 1) post the
>>     smallest possible code that fails, and 2) post a small part of the
>>     raw PDF, i.e. the objects relevant to the field in your code.
>>
>>
>>     Tilman
>>
>>
>>     Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>
>>         Moreover, for every page of the compressed PDF (there are 3
>>         pages), I tried getting the COSStream for each of the page :
>>
>>         PDPage firstPage=(PDPage)
>>         document.getDocumentCatalog().getAllPages().get(0);
>>                     pdStream=firstPage.getContents();
>>                     COSStream stream=pdStream.getStream();
>>
>>         In the above code snippet, the object stream, when analyzed in
>>         debug mode, has the following:
>>
>>
>>         The line from the compressed PDF as opened with Notepad++ is :
>>
>>         <</Filter/FlateDecode/Length 5675>>stream
>>
>>         From this point on, using the COSStream object for every page,
>>         how can I decompress and find out the acroform fields given
>>         that the unFilteredStream object is null for COSStream?
>>         ​
>>
>>         On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan
>>         <bvenkata@tibco.com <ma...@tibco.com>
>>         <mailto:bvenkata@tibco.com <ma...@tibco.com>>> wrote:
>>
>>             Thank you for your response Tilman.
>>
>>             I had previously tried using the WriteDecodedDoc for my
>>         compressed
>>             PDF and I tried to get the number of acro form fields
>>         present in     the output file generated by WriteDecodedDoc.
>>         The API still could
>>             not find the acro form fields in the generated
>>         decompressed file.
>>              Also the decompressed file generated is 75 KB which is
>>         far less
>>             than the original decompressed file which I have (1.6 MB)
>>         though I
>>             could edit the acro form fields using acrobat reader.
>>
>>             Thanks,
>>             Balaji
>>
>>
>>
>>             On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>             <THausherr@t-online.de <ma...@t-online.de>
>>         <mailto:THausherr@t-online.de <ma...@t-online.de>>>
>>         wrote:
>>
>>                 Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>
>>                     My question is: how do I flatedecode a PDF so that
>>         I can
>>                     find all the
>>                     acroform fields within it. ANy help or pointers
>>         would be
>>                     highly appreciated.
>>
>>
>>                 You could try the WriteDecodedDoc option of the
>>         command line app
>>         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>
>>                 Maybe you can have further ideas by comparing the two
>>         files
>>                 with NOTEPAD++.... however the two files might have their
>>                 objects in different order.
>>
>>                 Tilman
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>                 To unsubscribe, e-mail:
>>         users-unsubscribe@pdfbox.apache.org
>>         <ma...@pdfbox.apache.org>
>>                 <mailto:users-unsubscribe@pdfbox.apache.org
>>         <ma...@pdfbox.apache.org>>
>>                 For additional commands, e-mail:
>>         users-help@pdfbox.apache.org <mailto:users-help@pdfbox.apache.org
>> >
>>                 <mailto:users-help@pdfbox.apache.org
>>         <ma...@pdfbox.apache.org>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

"How to properly decompress the PDF using pdfbox APIs" - see the source 
code of WriteDecodedDoc:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/WriteDecodedDoc.java?view=markup&sortby=date

How was the decompressing of the PDF from your customer done - did your 
customer also use PDFBox? Or something else?

And I read in the first post that the decompressed customer file was OK, 
but not the compressed file...  so the problem is to find if something 
is missing in the compressed file, or if PDFBox has a bug causing to 
miss it.

Tilman

PS: image didn't go through. Maybe upload it to imageshack.us.


Am 20.05.2015 um 03:24 schrieb Balaji Venkatamohan:
> Thank you for your pointers and sorry about the image. I am attaching 
> it with this email.
>
> The point I am trying to make is that the PDF, which was decompressed 
> using WriteDecodedDoc, is smaller in size than the original PDF given 
> to us by our customers.
> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did 
> not have any PDAcroform fields whereas the decompressed PDF given to 
> us by the customers does contain Acroform fields. Hence I wanted to 
> know how to properly decompress the PDF using pdfbox APIs. The reason 
> why I was analyzing COSStream was to check if the decompression of the 
> compressed PDF was happening correctly while using PDFBox APIs.
> I know it would have been difficult for you to help me without the 
> actual PDFs. For that, I would like to thank you for your time and 
> pointers.
>
> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr 
> <THausherr@t-online.de <ma...@t-online.de>> wrote:
>
>     Hi,
>
>     The image doesn't appear in the mailing list.
>
>     This is all very confusing... /acroform is in the document
>     catalog. I don't see how the page content stream is related to it.
>     The best is that you either go through the source code, or read
>     the spec and then look at the pdf.
>
>     To find out what's going on, you'd have to start from that
>     /acroform entry and then compare the two files.
>
>     It is really difficult to help you without the files. The cause
>     could be a bug in pdfbox, or a malformed pdf...
>
>     Some more ideas:
>     - use loadNonSeq(file, null) instead of load(file)
>     - try the unreleased 2.0 version, that one has some improvements
>     in the acroform stuff. Note that the API is different.
>     https://pdfbox.apache.org/download.cgi#scm
>     https://pdfbox.apache.org/2.0/getting-started.html
>
>     If you still need help, one possibility would be 1) post the
>     smallest possible code that fails, and 2) post a small part of the
>     raw PDF, i.e. the objects relevant to the field in your code.
>
>
>     Tilman
>
>
>     Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>
>         Moreover, for every page of the compressed PDF (there are 3
>         pages), I tried getting the COSStream for each of the page :
>
>         PDPage firstPage=(PDPage)
>         document.getDocumentCatalog().getAllPages().get(0);
>                     pdStream=firstPage.getContents();
>                     COSStream stream=pdStream.getStream();
>
>         In the above code snippet, the object stream, when analyzed in
>         debug mode, has the following:
>
>
>         The line from the compressed PDF as opened with Notepad++ is :
>
>         <</Filter/FlateDecode/Length 5675>>stream
>
>         From this point on, using the COSStream object for every page,
>         how can I decompress and find out the acroform fields given
>         that the unFilteredStream object is null for COSStream?
>         ​
>
>         On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan
>         <bvenkata@tibco.com <ma...@tibco.com>
>         <mailto:bvenkata@tibco.com <ma...@tibco.com>>> wrote:
>
>             Thank you for your response Tilman.
>
>             I had previously tried using the WriteDecodedDoc for my
>         compressed
>             PDF and I tried to get the number of acro form fields
>         present in     the output file generated by WriteDecodedDoc.
>         The API still could
>             not find the acro form fields in the generated
>         decompressed file.
>              Also the decompressed file generated is 75 KB which is
>         far less
>             than the original decompressed file which I have (1.6 MB)
>         though I
>             could edit the acro form fields using acrobat reader.
>
>             Thanks,
>             Balaji
>
>
>
>             On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>             <THausherr@t-online.de <ma...@t-online.de>
>         <mailto:THausherr@t-online.de <ma...@t-online.de>>>
>         wrote:
>
>                 Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>
>                     My question is: how do I flatedecode a PDF so that
>         I can
>                     find all the
>                     acroform fields within it. ANy help or pointers
>         would be
>                     highly appreciated.
>
>
>                 You could try the WriteDecodedDoc option of the
>         command line app
>         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>
>                 Maybe you can have further ideas by comparing the two
>         files
>                 with NOTEPAD++.... however the two files might have their
>                 objects in different order.
>
>                 Tilman
>
>
>
>         ---------------------------------------------------------------------
>                 To unsubscribe, e-mail:
>         users-unsubscribe@pdfbox.apache.org
>         <ma...@pdfbox.apache.org>
>                 <mailto:users-unsubscribe@pdfbox.apache.org
>         <ma...@pdfbox.apache.org>>
>                 For additional commands, e-mail:
>         users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>
>                 <mailto:users-help@pdfbox.apache.org
>         <ma...@pdfbox.apache.org>>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Thank you for your pointers and sorry about the image. I am attaching it
with this email.

The point I am trying to make is that the PDF, which was decompressed using
WriteDecodedDoc, is smaller in size than the original PDF given to us by
our customers.
Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did not
have any PDAcroform fields whereas the decompressed PDF given to us by the
customers does contain Acroform fields. Hence I wanted to know how to
properly decompress the PDF using pdfbox APIs. The reason why I was
analyzing COSStream was to check if the decompression of the compressed PDF
was happening correctly while using PDFBox APIs.
I know it would have been difficult for you to help me without the actual
PDFs. For that, I would like to thank you for your time and pointers.

On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> The image doesn't appear in the mailing list.
>
> This is all very confusing... /acroform is in the document catalog. I
> don't see how the page content stream is related to it. The best is that
> you either go through the source code, or read the spec and then look at
> the pdf.
>
> To find out what's going on, you'd have to start from that /acroform entry
> and then compare the two files.
>
> It is really difficult to help you without the files. The cause could be a
> bug in pdfbox, or a malformed pdf...
>
> Some more ideas:
> - use loadNonSeq(file, null) instead of load(file)
> - try the unreleased 2.0 version, that one has some improvements in the
> acroform stuff. Note that the API is different.
> https://pdfbox.apache.org/download.cgi#scm
> https://pdfbox.apache.org/2.0/getting-started.html
>
> If you still need help, one possibility would be 1) post the smallest
> possible code that fails, and 2) post a small part of the raw PDF, i.e. the
> objects relevant to the field in your code.
>
>
> Tilman
>
>
> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>
>> Moreover, for every page of the compressed PDF (there are 3 pages), I
>> tried getting the COSStream for each of the page :
>>
>> PDPage firstPage=(PDPage)
>> document.getDocumentCatalog().getAllPages().get(0);
>>             pdStream=firstPage.getContents();
>>             COSStream stream=pdStream.getStream();
>>
>> In the above code snippet, the object stream, when analyzed in debug
>> mode, has the following:
>>
>>
>> The line from the compressed PDF as opened with Notepad++ is :
>>
>> <</Filter/FlateDecode/Length 5675>>stream
>>
>> From this point on, using the COSStream object for every page, how can I
>> decompress and find out the acroform fields given that the unFilteredStream
>> object is null for COSStream?
>> ​
>>
>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <bvenkata@tibco.com
>> <ma...@tibco.com>> wrote:
>>
>>     Thank you for your response Tilman.
>>
>>     I had previously tried using the WriteDecodedDoc for my compressed
>>     PDF and I tried to get the number of acro form fields present in
>>  the output file generated by WriteDecodedDoc. The API still could
>>     not find the acro form fields in the generated decompressed file.
>>      Also the decompressed file generated is 75 KB which is far less
>>     than the original decompressed file which I have (1.6 MB) though I
>>     could edit the acro form fields using acrobat reader.
>>
>>     Thanks,
>>     Balaji
>>
>>
>>
>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>     <THausherr@t-online.de <ma...@t-online.de>> wrote:
>>
>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>
>>             My question is: how do I flatedecode a PDF so that I can
>>             find all the
>>             acroform fields within it. ANy help or pointers would be
>>             highly appreciated.
>>
>>
>>         You could try the WriteDecodedDoc option of the command line app
>>         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>
>>         Maybe you can have further ideas by comparing the two files
>>         with NOTEPAD++.... however the two files might have their
>>         objects in different order.
>>
>>         Tilman
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>         To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>         <ma...@pdfbox.apache.org>
>>         For additional commands, e-mail: users-help@pdfbox.apache.org
>>         <ma...@pdfbox.apache.org>
>>
>>
>>
>>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

The image doesn't appear in the mailing list.

This is all very confusing... /acroform is in the document catalog. I 
don't see how the page content stream is related to it. The best is that 
you either go through the source code, or read the spec and then look at 
the pdf.

To find out what's going on, you'd have to start from that /acroform 
entry and then compare the two files.

It is really difficult to help you without the files. The cause could be 
a bug in pdfbox, or a malformed pdf...

Some more ideas:
- use loadNonSeq(file, null) instead of load(file)
- try the unreleased 2.0 version, that one has some improvements in the 
acroform stuff. Note that the API is different.
https://pdfbox.apache.org/download.cgi#scm
https://pdfbox.apache.org/2.0/getting-started.html

If you still need help, one possibility would be 1) post the smallest 
possible code that fails, and 2) post a small part of the raw PDF, i.e. 
the objects relevant to the field in your code.


Tilman


Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> Moreover, for every page of the compressed PDF (there are 3 pages), I 
> tried getting the COSStream for each of the page :
>
> PDPage firstPage=(PDPage) 
> document.getDocumentCatalog().getAllPages().get(0);
>             pdStream=firstPage.getContents();
>             COSStream stream=pdStream.getStream();
>
> In the above code snippet, the object stream, when analyzed in debug 
> mode, has the following:
>
>
> The line from the compressed PDF as opened with Notepad++ is :
>
> <</Filter/FlateDecode/Length 5675>>stream
>
> From this point on, using the COSStream object for every page, how can 
> I decompress and find out the acroform fields given that the 
> unFilteredStream object is null for COSStream?
> ​
>
> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan 
> <bvenkata@tibco.com <ma...@tibco.com>> wrote:
>
>     Thank you for your response Tilman.
>
>     I had previously tried using the WriteDecodedDoc for my compressed
>     PDF and I tried to get the number of acro form fields present in 
>     the output file generated by WriteDecodedDoc. The API still could
>     not find the acro form fields in the generated decompressed file.
>      Also the decompressed file generated is 75 KB which is far less
>     than the original decompressed file which I have (1.6 MB) though I
>     could edit the acro form fields using acrobat reader.
>
>     Thanks,
>     Balaji
>
>
>
>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>     <THausherr@t-online.de <ma...@t-online.de>> wrote:
>
>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>
>             My question is: how do I flatedecode a PDF so that I can
>             find all the
>             acroform fields within it. ANy help or pointers would be
>             highly appreciated.
>
>
>         You could try the WriteDecodedDoc option of the command line app
>         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>
>         Maybe you can have further ideas by comparing the two files
>         with NOTEPAD++.... however the two files might have their
>         objects in different order.
>
>         Tilman
>
>
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>         <ma...@pdfbox.apache.org>
>         For additional commands, e-mail: users-help@pdfbox.apache.org
>         <ma...@pdfbox.apache.org>
>
>
>


Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Moreover, for every page of the compressed PDF (there are 3 pages), I tried
getting the COSStream for each of the page :

            PDPage firstPage=(PDPage)
document.getDocumentCatalog().getAllPages().get(0);
            pdStream=firstPage.getContents();
            COSStream stream=pdStream.getStream();

In the above code snippet, the object stream, when analyzed in debug mode,
has the following:


The line from the compressed PDF as opened with Notepad++ is :

         <</Filter/FlateDecode/Length 5675>>stream

>From this point on, using the COSStream object for every page, how can I
decompress and find out the acroform fields given that the unFilteredStream
object is null for COSStream?
​

On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <bv...@tibco.com>
wrote:

> Thank you for your response Tilman.
>
> I had previously tried using the WriteDecodedDoc for my compressed PDF and
> I tried to get the number of acro form fields present in  the output file
> generated by WriteDecodedDoc. The API still could not find the acro form
> fields in the generated decompressed file.
>  Also the decompressed file generated is 75 KB which is far less than the
> original decompressed file which I have (1.6 MB) though I could edit the
> acro form fields using acrobat reader.
>
> Thanks,
> Balaji
>
>
>
> On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>
>>> My question is: how do I flatedecode a PDF so that I can find all the
>>> acroform fields within it. ANy help or pointers would be highly
>>> appreciated.
>>>
>>
>> You could try the WriteDecodedDoc option of the command line app
>> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>
>> Maybe you can have further ideas by comparing the two files with
>> NOTEPAD++.... however the two files might have their objects in different
>> order.
>>
>> Tilman
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Balaji Venkatamohan <bv...@tibco.com>.
Thank you for your response Tilman.

I had previously tried using the WriteDecodedDoc for my compressed PDF and
I tried to get the number of acro form fields present in  the output file
generated by WriteDecodedDoc. The API still could not find the acro form
fields in the generated decompressed file.
 Also the decompressed file generated is 75 KB which is far less than the
original decompressed file which I have (1.6 MB) though I could edit the
acro form fields using acrobat reader.

Thanks,
Balaji



On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>
>> My question is: how do I flatedecode a PDF so that I can find all the
>> acroform fields within it. ANy help or pointers would be highly
>> appreciated.
>>
>
> You could try the WriteDecodedDoc option of the command line app
> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>
> Maybe you can have further ideas by comparing the two files with
> NOTEPAD++.... however the two files might have their objects in different
> order.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
> My question is: how do I flatedecode a PDF so that I can find all the
> acroform fields within it. ANy help or pointers would be highly appreciated.

You could try the WriteDecodedDoc option of the command line app
https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc

Maybe you can have further ideas by comparing the two files with 
NOTEPAD++.... however the two files might have their objects in 
different order.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org