You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2008/11/11 23:29:45 UTC
[jira] Updated: (PDFBOX-318) Error getting pdf version
[ https://issues.apache.org/jira/browse/PDFBOX-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-318:
--------------------------------------
Attachment: patch_pdfbox-318.diff
Obviously some files are misformed concerning their header. I've made a patch to make the extraction of the version of a pdf-document more robust. Therefore I've added a new method to the class org.apache.pdfbox.pdfparser.PDFParser. With this patch the document mentioned in the issue description will be parsed without problems.
> Error getting pdf version
> -------------------------
>
> Key: PDFBOX-318
> URL: https://issues.apache.org/jira/browse/PDFBOX-318
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Attachments: patch_pdfbox-318.diff
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1822452
> Originally submitted by nobody on 2007-10-29 17:37.
> java.io.IOException: Error getting pdf version:java.lang.NumberFormatException: For input string: "-"
> at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:166)
> at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
> at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
> at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:633)
> at test.pdfbox.pdfparser.TestPDFParser.test_exception_version1(TestPDFParser.java:112)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at junit.framework.TestCase.runTest(TestCase.java:154)
> at junit.framework.TestCase.runBare(TestCase.java:127)
> at junit.framework.TestResult$1.protect(TestResult.java:106)
> at junit.framework.TestResult.runProtected(TestResult.java:124)
> at junit.framework.TestResult.run(TestResult.java:109)
> at junit.framework.TestCase.run(TestCase.java:118)
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1822452&file_id=251894
> exception_version1.pdf (application/pdf), 196864 bytes
> [comment on SourceForge]
> Originally sent by nobody.
> Logged In: NO
> Someone can put a better more throughtful fix in.
> Here is what I did to fix it.
> PDFParser.java:
> public void parse() throws IOException
> {
> try
> {
> if ( raf == null )
> {
> checktmpDir();
> document = new COSDocument( tempDirectory );
> }
> else
> {
> document = new COSDocument( raf );
> }
> setDocument( document );
> findVersion(); // New method see below.
> // Code to find version moved to method findVersion();
> skipHeaderFillBytes();
> Object nextObject;
> [...]
> ----
> /**
> * Attempt to find version in the following form %PDF-<number><0a|0d>%
> * @throws IOException
> */
> private void findVersion() throws IOException
> {
> String header = null;
> // try 5 lines to get PDF Version.
> for ( int i = 0; i < 5; i++) {
> header = readLine();
>
> //sometimes there are some garbage bytes in the header before the header
> //actually starts, so lets try to find the header first.
> int headerStart = header.indexOf( PDF_HEADER );
> //greater than zero because if it is zero then
> //there is no point of trimming
> if( headerStart > 0 )
> {
> //trim off any leading characters
> header = header.substring( headerStart, header.length() );
> } else if (headerStart < 0)
> continue; // Did not find the Header Go look at next line
>
> document.setHeaderString( header );
> try
> {
> float pdfVersion = Float.parseFloat(
> header.substring( PDF_HEADER.length(), Math.min( header.length(), PDF_HEADER.length()+3) ) );
> document.setVersion( pdfVersion );
> return; // Express return.
> }
> catch( NumberFormatException e )
> {
> throw new IOException( "Error getting pdf version: " + header + "\n" + e );
> }
> }
> throw new IOException( "Unable to find version");
> }
> ----
> [comment on SourceForge]
> Originally sent by nobody.
> Logged In: NO
> Debugged it with a hex dump on the submitted file
> ---
> Appears that the Version started at office 0x80 instead of the first line.
> AdobeReader 7.x appears to have skipped to the right version and display the rest properly.
> So I think something needs to be done with PDFParser::parse() version checking.
> 00000000: 001f 3339 3339 202d 2057 4648 202d 2050 ..3939 - WFH - P
> 00000010: 7265 7020 666f 2331 3533 3245 332e 7064 rep fo#1532E3.pd
> 00000020: 6600 0000 0000 0000 0000 0000 0000 0000 f...............
> 00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 00000050: 0000 0000 0300 2100 0000 00c2 550d 05c2 ......!.....U...
> 00000060: 550d 0500 0000 0000 0000 0000 0000 0000 U...............
> 00000070: 0000 0000 0000 0000 0000 8181 af49 0000 .............I..
> 00000080: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%......
> 00000090: f3a0 d0c4 c60a 3220 3020 6f62 6a0a 3c3c ......2 0 obj.<<
> [comment on SourceForge]
> Originally sent by nobody.
> Logged In: NO
> Tested on 0.7.2, 0.7.3, latest 0.7.4-2007-10-22
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.