You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sean Bridges (JIRA)" <ji...@apache.org> on 2009/05/12 21:18:45 UTC
[jira] Created: (PDFBOX-466) error parsing files generated by
crystal reports
error parsing files generated by crystal reports
------------------------------------------------
Key: PDFBOX-466
URL: https://issues.apache.org/jira/browse/PDFBOX-466
Project: PDFBox
Issue Type: Bug
Components: FontBox
Reporter: Sean Bridges
Fix For: 0.8.0-incubator
This is with the latest from svn, Revision: 773978
>From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
... 2 more
I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708594#action_12708594 ]
Sean Bridges commented on PDFBOX-466:
-------------------------------------
Looking at one of the pdf's, It ends with,
<<
/Producer (Powered By Crystal)
/Creator (Crystal Reports)
>>
endobj
xref
0 36
0000000000 65535 f
0000000017 00000 n
0000037961 00000 n
0000038060 00000 n
0000038094 00000 n
0000000194 00000 n
0000038128 00000 n
0000038250 00000 n
0000038308 00000 n
0000038400 00000 n
0000055457 00000 n
0000055511 00000 n
0000056340 00000 n
0000056516 00000 n
0000056692 00000 n
0000056868 00000 n
0000057217 00000 n
0000000823 00000 n
0000057256 00000 n
0000057524 00000 n
0000001348 00000 n
0000057567 00000 n
0000057891 00000 n
0000009425 00000 n
0000057924 00000 n
0000058191 00000 n
0000009867 00000 n
0000058234 00000 n
0000058603 00000 n
0000021478 00000 n
0000058641 00000 n
0000058908 00000 n
0000022076 00000 n
0000058951 00000 n
0000058991 00000 n
0000059028 00000 n
trailer
<<
/Size 36
/Root 1 0 R
/Info 35 0 R
>>
startxref
59116
%%EOF
The exception is thrown after reading the "0 36" after xref. The line,
objectKey = readString( 3 );
Reads "000", which is not "obj", and the exception is thrown.
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Bridges updated PDFBOX-466:
--------------------------------
Attachment: patch
This patch fixes the issue. Crystal reports adds a space after xref and startxref. trim()'ing the lines before comparison makes it work.
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
> Attachments: patch
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Bridges updated PDFBOX-466:
--------------------------------
Attachment: patch2
I've discovered some other files with xref sections like,
endobj
xref 0 127
0000000000 65535 f
0000000017 00000 n
0000000113 00000 n
in this case there is no newline between the xref and the first integer
this patch, applied on top of the previous one will allow those files to parse as well.
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
> Attachments: patch, patch2
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "anybudy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797110#action_12797110 ]
anybudy commented on PDFBOX-466:
--------------------------------
I am using .net version of 0.8.0-incubator (was hard to collect reqired files), pdf supplier changed the pdf format and I need to extract text from pdf files which are created by cyrstal reports(pdf version 1.6 and acrobat 7.x). I am getting an exception. I think it is the same problem? Could you please help me with this? I have no java knowledge.
org.apache.pdfbox.exceptions.WrappedIOException was unhandled
StackTrace:
at org.apache.pdfbox.pdfparser.PDFParser.parse()
at org.apache.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
at org.apache.pdfbox.pdmodel.PDDocument.load(InputStream input)
at org.apache.pdfbox.pdmodel.PDDocument.load(String filename)
at BetMatik.pdfManipulationClass.readPDF(String fileName, String date) in ....
Thank you very much.
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
> Attachments: patch, patch2, patch2_again
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717156#action_12717156 ]
Andreas Lehmkühler commented on PDFBOX-466:
-------------------------------------------
Hi Sean,
you didn't grant the license to the ASF for the attached patch "patch2". Please correct that so that we''l be able to include your changes.
Thanks in advance
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
> Attachments: patch, patch2
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "Sean Bridges (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Bridges updated PDFBOX-466:
--------------------------------
Attachment: patch2_again
as above, but with the grant for inclusion radio checked
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
> Attachments: patch, patch2, patch2_again
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-466) error parsing files generated by
crystal reports
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-466.
---------------------------------------
Resolution: Fixed
The issue is fixed with version 783210.
Thanks Sean for your contribution
> error parsing files generated by crystal reports
> ------------------------------------------------
>
> Key: PDFBOX-466
> URL: https://issues.apache.org/jira/browse/PDFBOX-466
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Reporter: Sean Bridges
> Fix For: 0.8.0-incubator
>
> Attachments: patch, patch2, patch2_again
>
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 200 exceptions with the stack trace,
> Caused by: java.io.IOException: expected='obj' actual='000' org.apache.pdfbox.io.PushBackInputStream@1049d3
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:471)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:169)
> at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:32)
> ... 2 more
> I can't give an example file, but the pdfs are all generated by crystal reports.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.