You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (JIRA)" <ji...@apache.org> on 2013/03/29 08:55:15 UTC
[jira] [Commented] (PDFBOX-1555) Javascript at the end of the PDF document fails parsing

    [ https://issues.apache.org/jira/browse/PDFBOX-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617169#comment-13617169 ] 

Maruan Sahyoun commented on PDFBOX-1555:
----------------------------------------

Hi,

we might be able to improve the parser to handle such a situation but the PDF is not valid. A valid PDF ends with %%EOF and further data is not permitted (see section 7.5.5 of the ISO-32000 specification). I think it would be much better to identify the route cause of the issue which is why the JavaScript (for google analytics) is appended to the PDF you are getting and fix that. The facts that some viewers might be able to handle the PDF doesn't mean it's a valid PDF.

With kind regards

Maruan Sahyoun
                
> Javascript at the end of the PDF document fails parsing
> -------------------------------------------------------
>
>                 Key: PDFBOX-1555
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1555
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.8.0
>            Reporter: Jinder Aujla
>
> Hi
> I was investigating a failure to parse and debugging the pdfbox code when I noticed in the PDF document that I can't forward at the end of the file this:
> %%EOF^M
> ^M
> ^M
> <script type="text/javascript">^M
> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");^M
> document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));^M
> </script>^M
> <script type="text/javascript">^M
> try {^M
> var pageTracker = _gat._getTracker("UA-7429935-1");^M
> pageTracker._trackPageview();^M
> } catch(err) {}</script>^M
> ^M
> ^M
> So the document ends.. but there is more content.. basically some javascript. What the parser does is it gets to 
> line 492 in org.apache.pdfbox.pdfparser.PDFParser
> isEndOfFile get's set to true, but because it's not the end of the actual stream.. it continues this was a fix in PDFBOX-979.
> Next time around in the loop it reads
> <script type="text/javascript">
> which I think it ignores.. then trys to read 
> var
> twice as a number. Then blows up.. so I've playing around thinking of sensible thing to do. But worried that I might introduce some other issue. I assume this is legal structure for a PDFDocument. It opens fine in a viewer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira