You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 02:45:35 UTC

[jira] [Updated] (PDFBOX-1555) Javascript after %%EOF fails parsing

     [ https://issues.apache.org/jira/browse/PDFBOX-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-1555:
--------------------------------
    Fix Version/s: 2.0.0

> Javascript after %%EOF fails parsing
> ------------------------------------
>
>                 Key: PDFBOX-1555
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1555
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.0
>            Reporter: Jinder Aujla
>             Fix For: 2.0.0
>
>         Attachments: 0001-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch, 0002-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch
>
>
> Hi
> I was investigating a failure to parse and debugging the pdfbox code when I noticed in the PDF document that I can't forward at the end of the file this:
> %%EOF^M
> ^M
> ^M
> <script type="text/javascript">^M
> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");^M
> document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));^M
> </script>^M
> <script type="text/javascript">^M
> try {^M
> var pageTracker = _gat._getTracker("UA-7429935-1");^M
> pageTracker._trackPageview();^M
> } catch(err) {}</script>^M
> ^M
> ^M
> So the document ends.. but there is more content.. basically some javascript. What the parser does is it gets to 
> line 492 in org.apache.pdfbox.pdfparser.PDFParser
> isEndOfFile get's set to true, but because it's not the end of the actual stream.. it continues this was a fix in PDFBOX-979.
> Next time around in the loop it reads
> <script type="text/javascript">
> which I think it ignores.. then trys to read 
> var
> twice as a number. Then blows up.. so I've playing around thinking of sensible thing to do. But worried that I might introduce some other issue. I assume this is legal structure for a PDFDocument. It opens fine in a viewer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)