You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sean Bridges (JIRA)" <ji...@apache.org> on 2009/05/12 21:18:45 UTC

[jira] Created: (PDFBOX-466) error parsing files generated by crystal reports

error parsing files generated by crystal reports
------------------------------------------------

                 Key: PDFBOX-466
                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
             Project: PDFBox
          Issue Type: Bug
          Components: FontBox
            Reporter: Sean Bridges
             Fix For: 0.8.0-incubator


This is with the latest from svn, Revision: 773978

>From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,

Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
	... 2 more

I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708594#action_12708594 ] 

Sean Bridges commented on PDFBOX-466:
-------------------------------------

Looking at one of the pdf's, It ends with,

<< 
/Producer (Powered By Crystal)  
/Creator (Crystal Reports)  
>> 
endobj 
xref 
0 36 
0000000000 65535 f 
0000000017 00000 n 
0000037961 00000 n 
0000038060 00000 n 
0000038094 00000 n 
0000000194 00000 n 
0000038128 00000 n 
0000038250 00000 n 
0000038308 00000 n 
0000038400 00000 n 
0000055457 00000 n 
0000055511 00000 n 
0000056340 00000 n 
0000056516 00000 n 
0000056692 00000 n 
0000056868 00000 n 
0000057217 00000 n 
0000000823 00000 n 
0000057256 00000 n 
0000057524 00000 n 
0000001348 00000 n 
0000057567 00000 n 
0000057891 00000 n 
0000009425 00000 n 
0000057924 00000 n 
0000058191 00000 n 
0000009867 00000 n 
0000058234 00000 n 
0000058603 00000 n 
0000021478 00000 n 
0000058641 00000 n 
0000058908 00000 n 
0000022076 00000 n 
0000058951 00000 n 
0000058991 00000 n 
0000059028 00000 n 
trailer 
<< 
/Size 36 
/Root 1 0 R 
/Info 35 0 R 
>> 
startxref 
59116 
%%EOF 


The exception is thrown after reading the "0 36" after xref.  The line,

objectKey = readString( 3 );

Reads "000", which is not "obj", and the exception is thrown.

> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Bridges updated PDFBOX-466:
--------------------------------

    Attachment: patch

This patch fixes the issue.  Crystal reports adds a space after xref and startxref.  trim()'ing the lines before comparison makes it work.

> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>         Attachments: patch
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Bridges updated PDFBOX-466:
--------------------------------

    Attachment: patch2

I've discovered some other files with xref sections like,

endobj
xref 0 127
0000000000 65535 f
0000000017 00000 n
0000000113 00000 n

in this case there is no newline between the xref and the first integer

this patch, applied on top of the previous one will allow those files to parse as well.

> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>         Attachments: patch, patch2
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "anybudy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797110#action_12797110 ] 

anybudy commented on PDFBOX-466:
--------------------------------

I am using .net version of 0.8.0-incubator (was hard to collect reqired files), pdf supplier changed the pdf format and I need to extract text from pdf files which are created by cyrstal reports(pdf version 1.6 and acrobat 7.x). I am getting an exception. I think it is the same problem? Could you please help me with this? I have no java knowledge.

org.apache.pdfbox.exceptions.WrappedIOException was unhandled
StackTrace:
       at org.apache.pdfbox.pdfparser.PDFParser.parse()
       at org.apache.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
       at org.apache.pdfbox.pdmodel.PDDocument.load(InputStream input)
       at org.apache.pdfbox.pdmodel.PDDocument.load(String filename)
       at BetMatik.pdfManipulationClass.readPDF(String fileName, String date) in ....


Thank you very much.



> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>         Attachments: patch, patch2, patch2_again
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717156#action_12717156 ] 

Andreas Lehmkühler commented on PDFBOX-466:
-------------------------------------------

Hi Sean,

you didn't grant the license to the ASF for the attached patch "patch2". Please correct that so that we''l be able to include your changes.

Thanks in advance

> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>         Attachments: patch, patch2
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Bridges updated PDFBOX-466:
--------------------------------

    Attachment: patch2_again

as above, but with the grant for inclusion radio checked

> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>         Attachments: patch, patch2, patch2_again
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-466) error parsing files generated by crystal reports

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-466.
---------------------------------------

    Resolution: Fixed

The issue is fixed with version 783210.

Thanks Sean for your contribution

> error parsing files generated by crystal reports
> ------------------------------------------------
>
>                 Key: PDFBOX-466
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-466
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>            Reporter: Sean Bridges
>             Fix For: 0.8.0-incubator
>
>         Attachments: patch, patch2, patch2_again
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> 	at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> 	... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.