You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Evan Williams <ev...@zapprx.com> on 2017/03/03 13:31:30 UTC

Need Help With A Problematic PDF

I work with forms that come from drug manufacturers and pharmacies. Using
PDFBox (2.0.4), I fill them out and then display them in the browser or fax
them.

This has been working perfectly (thank you PDFBox team!), but I have a
mystery PDF which is causing trouble.

The first sign of trouble is, various PDF renderers in browsers were
displaying fields that had been filled in as blank (intermittently).
Chrome's PDF viewer was one such culprit. Need Appearances was set to true
with this form, so I set it to False and refreshed the appearances, which
may or may not have fixed it (if it doesn't I will do a full flatten on it).

More seriously, the faxing service that we use chokes on this form. They
are unable to render it even if I don't fill it out and send them the
original PDF. They are investigating this on their end, but since this is
the only PDF that has ever had this issue and because of its previous
suspicious behavior, I believe that it might be corrupt in some subtle way
that I don't understand. PDFBox seems to have no trouble working with it,
and it is viewable and printable in Acrobat and in third party viewers. But
I am uncertain.

I would very much appreciate any tips you can give me on examining this PDF
and debugging the issue.

The file is here:
https://dl.dropboxusercontent.com/u/25802656/Tyvaso-Revised.pdf

The PDFBox community is awesome, and I am very grateful for your time.
Thank you
-- 
*Evan Williams*
Sr. Software Engineer
evan.williams@zapprx.com

*www.ZappRx.com <http://www.zapprx.com/>*

Re: Need Help With A Problematic PDF

Posted by Evan Williams <ev...@zapprx.com>.
Hi Tilman,

Unfortunately I am most definitely not the creator of the PDF. I get the
forms from the drug manufacturers and pharmacies that produce them. And,
unfortunately, I am legally constrained for at least some of these forms to
use their exact PDF or an exact perfect recreation of it). The quality of
these forms is extremely variable.

My job is to make the best possible tools to take prescriptions in a
predictable, regular electronic record and use it to fill out these
specific forms, of which I have dozens, trending to hundreds.

That is the 'interesting' aspect of my job.

So unfortunately, improving the PDF by recreating it is not a viable
option. And even if it were, I could have to replicate that an unbounded
number of times for every 'bad' PDF in my library.

I am trying some things and I will tell you what I come up with.

Thank you so much for investing so much time in my problem. I truly
appreciate it.

On Mon, Mar 6, 2017 at 12:30 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 06.03.2017 um 18:09 schrieb Evan Williams:
>
>> Hi Tilman!
>>
>> So, it develops that the fax service has an upper limit of 4 Mb per
>> individual uploaded file. So I will probably try to implement your
>> PDFSplit
>> solution (I tried compressing the file but was unable to get it under 4
>> Mb).
>>
>> I am concerned about memory and processing time for the PDFSplit but the
>> fax service seems unwilling to contemplate fixing this on their end (their
>> actual render works fine with the form, it is just their API layer that
>> imposes the limit).
>>
>
> Thanks for the news.... I had another look at your PDF. It is not very
> efficient... for example, the same colorspace is used several times; the
> same font (but different subsets) is in each page. Instead of only one
> entry for each font. The company logos are not in an XObject form, so they
> are repeated for each page. If you're the creator of the file, maybe you
> can do something there...
>
> Tilman
>
>
>
>
>
>> On Fri, Mar 3, 2017 at 12:35 PM, Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>> Am 03.03.2017 um 14:31 schrieb Evan Williams:
>>>
>>> The first sign of trouble is, various PDF renderers in browsers were
>>>> displaying fields that had been filled in as blank (intermittently).
>>>> Chrome's PDF viewer was one such culprit. Need Appearances was set to
>>>> true
>>>> with this form, so I set it to False and refreshed the appearances,
>>>> which
>>>> may or may not have fixed it (if it doesn't I will do a full flatten on
>>>> it).
>>>>
>>>> More seriously, the faxing service that we use chokes on this form. They
>>>> are unable to render it even if I don't fill it out and send them the
>>>> original PDF. They are investigating this on their end, but since this
>>>> is
>>>> the only PDF that has ever had this issue and because of its previous
>>>> suspicious behavior, I believe that it might be corrupt in some subtle
>>>> way
>>>> that I don't understand. PDFBox seems to have no trouble working with
>>>> it,
>>>> and it is viewable and printable in Acrobat and in third party viewers.
>>>> But
>>>> I am uncertain.
>>>>
>>>> It would be useful to have a version of this PDF with entries and then
>>> see
>>> what happens with Chrome. And then pass that one there.
>>>
>>> Re that fax service, try PDFSplit to split it in single pages, and then
>>> try with every single page. That should narrow the problem.
>>>
>>> PDF is a complex format... there are many (including us) that don't
>>> implement all. Normally this shouldn't lead to a crash.
>>>
>>> Tilman
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


-- 
*Evan Williams*
Sr. Software Engineer
evan.williams@zapprx.com

*www.ZappRx.com <http://www.zapprx.com/>*

Re: Need Help With A Problematic PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 06.03.2017 um 18:09 schrieb Evan Williams:
> Hi Tilman!
>
> So, it develops that the fax service has an upper limit of 4 Mb per
> individual uploaded file. So I will probably try to implement your PDFSplit
> solution (I tried compressing the file but was unable to get it under 4 Mb).
>
> I am concerned about memory and processing time for the PDFSplit but the
> fax service seems unwilling to contemplate fixing this on their end (their
> actual render works fine with the form, it is just their API layer that
> imposes the limit).

Thanks for the news.... I had another look at your PDF. It is not very 
efficient... for example, the same colorspace is used several times; the 
same font (but different subsets) is in each page. Instead of only one 
entry for each font. The company logos are not in an XObject form, so 
they are repeated for each page. If you're the creator of the file, 
maybe you can do something there...

Tilman



>
> On Fri, Mar 3, 2017 at 12:35 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 03.03.2017 um 14:31 schrieb Evan Williams:
>>
>>> The first sign of trouble is, various PDF renderers in browsers were
>>> displaying fields that had been filled in as blank (intermittently).
>>> Chrome's PDF viewer was one such culprit. Need Appearances was set to true
>>> with this form, so I set it to False and refreshed the appearances, which
>>> may or may not have fixed it (if it doesn't I will do a full flatten on
>>> it).
>>>
>>> More seriously, the faxing service that we use chokes on this form. They
>>> are unable to render it even if I don't fill it out and send them the
>>> original PDF. They are investigating this on their end, but since this is
>>> the only PDF that has ever had this issue and because of its previous
>>> suspicious behavior, I believe that it might be corrupt in some subtle way
>>> that I don't understand. PDFBox seems to have no trouble working with it,
>>> and it is viewable and printable in Acrobat and in third party viewers.
>>> But
>>> I am uncertain.
>>>
>> It would be useful to have a version of this PDF with entries and then see
>> what happens with Chrome. And then pass that one there.
>>
>> Re that fax service, try PDFSplit to split it in single pages, and then
>> try with every single page. That should narrow the problem.
>>
>> PDF is a complex format... there are many (including us) that don't
>> implement all. Normally this shouldn't lead to a crash.
>>
>> Tilman
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Need Help With A Problematic PDF

Posted by Evan Williams <ev...@zapprx.com>.
Hi Tilman!

So, it develops that the fax service has an upper limit of 4 Mb per
individual uploaded file. So I will probably try to implement your PDFSplit
solution (I tried compressing the file but was unable to get it under 4 Mb).

I am concerned about memory and processing time for the PDFSplit but the
fax service seems unwilling to contemplate fixing this on their end (their
actual render works fine with the form, it is just their API layer that
imposes the limit).

I will tell you how it works out and follow up with any questions that I
might have.

Thank you so much for taking the time to respond with your excellent
suggestion.

On Fri, Mar 3, 2017 at 12:35 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 03.03.2017 um 14:31 schrieb Evan Williams:
>
>> The first sign of trouble is, various PDF renderers in browsers were
>> displaying fields that had been filled in as blank (intermittently).
>> Chrome's PDF viewer was one such culprit. Need Appearances was set to true
>> with this form, so I set it to False and refreshed the appearances, which
>> may or may not have fixed it (if it doesn't I will do a full flatten on
>> it).
>>
>> More seriously, the faxing service that we use chokes on this form. They
>> are unable to render it even if I don't fill it out and send them the
>> original PDF. They are investigating this on their end, but since this is
>> the only PDF that has ever had this issue and because of its previous
>> suspicious behavior, I believe that it might be corrupt in some subtle way
>> that I don't understand. PDFBox seems to have no trouble working with it,
>> and it is viewable and printable in Acrobat and in third party viewers.
>> But
>> I am uncertain.
>>
>
> It would be useful to have a version of this PDF with entries and then see
> what happens with Chrome. And then pass that one there.
>
> Re that fax service, try PDFSplit to split it in single pages, and then
> try with every single page. That should narrow the problem.
>
> PDF is a complex format... there are many (including us) that don't
> implement all. Normally this shouldn't lead to a crash.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


-- 
*Evan Williams*
Sr. Software Engineer
evan.williams@zapprx.com

*www.ZappRx.com <http://www.zapprx.com/>*

Re: Need Help With A Problematic PDF

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 03.03.2017 um 14:31 schrieb Evan Williams:
> The first sign of trouble is, various PDF renderers in browsers were
> displaying fields that had been filled in as blank (intermittently).
> Chrome's PDF viewer was one such culprit. Need Appearances was set to true
> with this form, so I set it to False and refreshed the appearances, which
> may or may not have fixed it (if it doesn't I will do a full flatten on it).
>
> More seriously, the faxing service that we use chokes on this form. They
> are unable to render it even if I don't fill it out and send them the
> original PDF. They are investigating this on their end, but since this is
> the only PDF that has ever had this issue and because of its previous
> suspicious behavior, I believe that it might be corrupt in some subtle way
> that I don't understand. PDFBox seems to have no trouble working with it,
> and it is viewable and printable in Acrobat and in third party viewers. But
> I am uncertain.

It would be useful to have a version of this PDF with entries and then 
see what happens with Chrome. And then pass that one there.

Re that fax service, try PDFSplit to split it in single pages, and then 
try with every single page. That should narrow the problem.

PDF is a complex format... there are many (including us) that don't 
implement all. Normally this shouldn't lead to a crash.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Need Help With A Problematic PDF

Posted by Evan Williams <ev...@zapprx.com>.
Karl,

Thank you so much. I will pursue that. And thank you for the tips on
investigating issues.

Unfortunately, even thought the ultimate result of my labors is a PDF, the
actual PDF generation is not really the focus of my work (it is figuring
out exactly what to put in the form fields). So I just do not have the time
to understand the PDF format as well as I should. I suspect many PDFBox
users have similar issues.

I really appreciate your help, and the wonderful PDFBox community. Thank
you again.

On Fri, Mar 3, 2017 at 9:23 AM, Karl Heinz Kremer <kh...@khk.net> wrote:

> This is a corrupt PDF file, but I doubt that the syntax problems are
> causing the problems with your fax service:
>
> If you have access to Adobe Acrobat Pro, you can run the "Report PDF syntax
> issues" preflight profile to get a list of syntax problems. There are two
> types of problems in your file: One is related to the tags structure, and
> you can test if that is causing any problems by just removing the tags
> tree. The second type of problem is related to the /Tabs entry in the page
> dictionary. This optional key can have the following values associated with
> it (from the PDF spec):
>
> *"(Optional; PDF 1.5) A name specifying the tab order that shall be used
> for annotations on the page. The possible values shall be R (row order), C
> (column order), and S (structure order). See 12.5, "Annotations" for
> details."*
> The document does however use a "W".
>
> You can change this by processing your document with the pdfbox-app, using
> the "WriteDecodedDoc" command to uncompress it, and then change the /W
> instances to e.g. /R or /C (this will change the way you tab through
> the document when filling out the form in Acrobat or Reader). Keep in
> mind that when editing a PDF file, you need to use a binary editor, and you
> have to make sure that the cross reference table is still valid after your
> changes.
>
>
> Karl Heinz Kremer
> PDF Acrobatics Without a Net
> PDF Software Development, Training and More...
>
> khk@khk.net
> http://www.khkonsulting.com
>
>
> On Fri, Mar 3, 2017 at 8:31 AM, Evan Williams <ev...@zapprx.com>
> wrote:
>
> > I work with forms that come from drug manufacturers and pharmacies. Using
> > PDFBox (2.0.4), I fill them out and then display them in the browser or
> fax
> > them.
> >
> > This has been working perfectly (thank you PDFBox team!), but I have a
> > mystery PDF which is causing trouble.
> >
> > The first sign of trouble is, various PDF renderers in browsers were
> > displaying fields that had been filled in as blank (intermittently).
> > Chrome's PDF viewer was one such culprit. Need Appearances was set to
> true
> > with this form, so I set it to False and refreshed the appearances, which
> > may or may not have fixed it (if it doesn't I will do a full flatten on
> > it).
> >
> > More seriously, the faxing service that we use chokes on this form. They
> > are unable to render it even if I don't fill it out and send them the
> > original PDF. They are investigating this on their end, but since this is
> > the only PDF that has ever had this issue and because of its previous
> > suspicious behavior, I believe that it might be corrupt in some subtle
> way
> > that I don't understand. PDFBox seems to have no trouble working with it,
> > and it is viewable and printable in Acrobat and in third party viewers.
> But
> > I am uncertain.
> >
> > I would very much appreciate any tips you can give me on examining this
> PDF
> > and debugging the issue.
> >
> > The file is here:
> > https://dl.dropboxusercontent.com/u/25802656/Tyvaso-Revised.pdf
> >
> > The PDFBox community is awesome, and I am very grateful for your time.
> > Thank you
> > --
> > *Evan Williams*
> > Sr. Software Engineer
> > evan.williams@zapprx.com
> >
> > *www.ZappRx.com <http://www.zapprx.com/>*
> >
>



-- 
*Evan Williams*
Sr. Software Engineer
evan.williams@zapprx.com

*www.ZappRx.com <http://www.zapprx.com/>*

Re: Need Help With A Problematic PDF

Posted by Karl Heinz Kremer <kh...@khk.net>.
This is a corrupt PDF file, but I doubt that the syntax problems are
causing the problems with your fax service:

If you have access to Adobe Acrobat Pro, you can run the "Report PDF syntax
issues" preflight profile to get a list of syntax problems. There are two
types of problems in your file: One is related to the tags structure, and
you can test if that is causing any problems by just removing the tags
tree. The second type of problem is related to the /Tabs entry in the page
dictionary. This optional key can have the following values associated with
it (from the PDF spec):

*"(Optional; PDF 1.5) A name specifying the tab order that shall be used
for annotations on the page. The possible values shall be R (row order), C
(column order), and S (structure order). See 12.5, "Annotations" for
details."*
The document does however use a "W".

You can change this by processing your document with the pdfbox-app, using
the "WriteDecodedDoc" command to uncompress it, and then change the /W
instances to e.g. /R or /C (this will change the way you tab through
the document when filling out the form in Acrobat or Reader). Keep in
mind that when editing a PDF file, you need to use a binary editor, and you
have to make sure that the cross reference table is still valid after your
changes.


Karl Heinz Kremer
PDF Acrobatics Without a Net
PDF Software Development, Training and More...

khk@khk.net
http://www.khkonsulting.com


On Fri, Mar 3, 2017 at 8:31 AM, Evan Williams <ev...@zapprx.com>
wrote:

> I work with forms that come from drug manufacturers and pharmacies. Using
> PDFBox (2.0.4), I fill them out and then display them in the browser or fax
> them.
>
> This has been working perfectly (thank you PDFBox team!), but I have a
> mystery PDF which is causing trouble.
>
> The first sign of trouble is, various PDF renderers in browsers were
> displaying fields that had been filled in as blank (intermittently).
> Chrome's PDF viewer was one such culprit. Need Appearances was set to true
> with this form, so I set it to False and refreshed the appearances, which
> may or may not have fixed it (if it doesn't I will do a full flatten on
> it).
>
> More seriously, the faxing service that we use chokes on this form. They
> are unable to render it even if I don't fill it out and send them the
> original PDF. They are investigating this on their end, but since this is
> the only PDF that has ever had this issue and because of its previous
> suspicious behavior, I believe that it might be corrupt in some subtle way
> that I don't understand. PDFBox seems to have no trouble working with it,
> and it is viewable and printable in Acrobat and in third party viewers. But
> I am uncertain.
>
> I would very much appreciate any tips you can give me on examining this PDF
> and debugging the issue.
>
> The file is here:
> https://dl.dropboxusercontent.com/u/25802656/Tyvaso-Revised.pdf
>
> The PDFBox community is awesome, and I am very grateful for your time.
> Thank you
> --
> *Evan Williams*
> Sr. Software Engineer
> evan.williams@zapprx.com
>
> *www.ZappRx.com <http://www.zapprx.com/>*
>