You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/08/04 22:02:13 UTC
[jira] [Comment Edited] (PDFBOX-1555) Javascript at the end of the PDF document fails parsing

    [ https://issues.apache.org/jira/browse/PDFBOX-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085197#comment-14085197 ] 

John Hewson edited comment on PDFBOX-1555 at 8/4/14 8:01 PM:
-------------------------------------------------------------

This file is, roughly speaking, valid from an Acrobat perspective. The [Adobe Supplement to the  ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf] in _3.4.4 File Trailer_ says:

{quote}
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
{quote}


was (Author: jahewson):
This file is, roughly speaking, valid from an Acrobat perspective. The [Adobe Supplement to the 
ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf] in _3.4.4 File Trailer_ says:

{quote}
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
{quote}

> Javascript at the end of the PDF document fails parsing
> -------------------------------------------------------
>
>                 Key: PDFBOX-1555
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1555
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.0
>            Reporter: Jinder Aujla
>         Attachments: 0001-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch, 0002-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch
>
>
> Hi
> I was investigating a failure to parse and debugging the pdfbox code when I noticed in the PDF document that I can't forward at the end of the file this:
> %%EOF^M
> ^M
> ^M
> <script type="text/javascript">^M
> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");^M
> document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));^M
> </script>^M
> <script type="text/javascript">^M
> try {^M
> var pageTracker = _gat._getTracker("UA-7429935-1");^M
> pageTracker._trackPageview();^M
> } catch(err) {}</script>^M
> ^M
> ^M
> So the document ends.. but there is more content.. basically some javascript. What the parser does is it gets to 
> line 492 in org.apache.pdfbox.pdfparser.PDFParser
> isEndOfFile get's set to true, but because it's not the end of the actual stream.. it continues this was a fix in PDFBOX-979.
> Next time around in the loop it reads
> <script type="text/javascript">
> which I think it ignores.. then trys to read 
> var
> twice as a number. Then blows up.. so I've playing around thinking of sensible thing to do. But worried that I might introduce some other issue. I assume this is legal structure for a PDFDocument. It opens fine in a viewer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)