You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/08/04 22:02:13 UTC
[jira] [Comment Edited] (PDFBOX-1555) Javascript at the end of the
PDF document fails parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085197#comment-14085197 ]
John Hewson edited comment on PDFBOX-1555 at 8/4/14 8:01 PM:
-------------------------------------------------------------
This file is, roughly speaking, valid from an Acrobat perspective. The [Adobe Supplement to the ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf] in _3.4.4 File Trailer_ says:
{quote}
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
{quote}
was (Author: jahewson):
This file is, roughly speaking, valid from an Acrobat perspective. The [Adobe Supplement to the
ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf] in _3.4.4 File Trailer_ says:
{quote}
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
{quote}
> Javascript at the end of the PDF document fails parsing
> -------------------------------------------------------
>
> Key: PDFBOX-1555
> URL: https://issues.apache.org/jira/browse/PDFBOX-1555
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.0
> Reporter: Jinder Aujla
> Attachments: 0001-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch, 0002-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch
>
>
> Hi
> I was investigating a failure to parse and debugging the pdfbox code when I noticed in the PDF document that I can't forward at the end of the file this:
> %%EOF^M
> ^M
> ^M
> <script type="text/javascript">^M
> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");^M
> document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));^M
> </script>^M
> <script type="text/javascript">^M
> try {^M
> var pageTracker = _gat._getTracker("UA-7429935-1");^M
> pageTracker._trackPageview();^M
> } catch(err) {}</script>^M
> ^M
> ^M
> So the document ends.. but there is more content.. basically some javascript. What the parser does is it gets to
> line 492 in org.apache.pdfbox.pdfparser.PDFParser
> isEndOfFile get's set to true, but because it's not the end of the actual stream.. it continues this was a fix in PDFBOX-979.
> Next time around in the loop it reads
> <script type="text/javascript">
> which I think it ignores.. then trys to read
> var
> twice as a number. Then blows up.. so I've playing around thinking of sensible thing to do. But worried that I might introduce some other issue. I assume this is legal structure for a PDFDocument. It opens fine in a viewer.
--
This message was sent by Atlassian JIRA
(v6.2#6252)