You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by reinhard schwab <re...@aon.at> on 2010/08/21 20:47:04 UTC

NPE in PDPageNode

i get a nullpointer exception when parsing a pdf with tika.

http://www.awsg.at/portal/media/4218.pdf

java.lang.NullPointerException
    at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
    at
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
    at
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:105)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:86)


regards
reinhard




Re: NPE in PDPageNode

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,


Am 24.08.2010 02:41, schrieb Adam@swmc.com:
> Reinhard,
>
> If you can get a copy of the unencrypted version, that'd be very helpful.
> If not, we'll just do the best we can with the PDF you have provided.  I
> tried removing the password with a program I have, but it seems to have
> run into the same issue with parsing the PDF as PDFBox did... so no luck
> there either.
The problem is the unsupported encryption algorithm. See my comment on [1] (I 
don't know why there wasn't any notification on dev@)

BR
Andreas Lehmkühler
[1] https://issues.apache.org/jira/browse/PDFBOX-797

Re: NPE in PDPageNode

Posted by Ad...@swmc.com.
Reinhard,

If you can get a copy of the unencrypted version, that'd be very helpful. 
If not, we'll just do the best we can with the PDF you have provided.  I 
tried removing the password with a program I have, but it seems to have 
run into the same issue with parsing the PDF as PDFBox did... so no luck 
there either.

---- 
Thanks,
Adam





From:
reinhard schwab <re...@aon.at>
To:
dev@pdfbox.apache.org
Date:
08/23/2010 13:24
Subject:
Re: NPE in PDPageNode



adam,

im sorry. i neither dont know what program has been used nor do i know
the password or
how to remove the encryption.
i only can ask some other people about this.
i will open a jira issue and attach the file.

best regards
reinhard

Adam@swmc.com schrieb:
> Reinhard,
>
> The root element in your PDF references object 1554 as the object which 
> informs us of the pages within this document.  This object does not seem 

> to exist in the PDF, which is a violation of the PDF spec and why PDFBox 

> is unable to parse it.  You can open the PDF in a decent text editor and 

> search for 1554 and you'll see the Pages section which references this 
> object, but that's the only place it's found, there's no object 
> definition.
>
> Now, having said that, if we can find a reliable way to parse files like 

> these, we can update the code.  Do you know what program was used to 
> create this PDF?  Would it be possible for you to remove the encryption 
on 
> this file and try it again?  That would make it much easier to debug (if 

> it still crashes without the encryption, it might not).
>
> I also encourage you to create an issue of JIRA and upload this file 
there 
> (in case the link dies in the future).  https://issues.apache.org/jira
>
> ---- 
> Thanks,
> Adam
>
>
>
>
>
> From:
> reinhard schwab <re...@aon.at>
> To:
> dev@pdfbox.apache.org
> Date:
> 08/21/2010 11:42
> Subject:
> NPE in PDPageNode
>
>
>
> i get a nullpointer exception when parsing a pdf with tika.
>
> http://www.awsg.at/portal/media/4218.pdf
>
> java.lang.NullPointerException
>     at 
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
>     at
> 
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
>     at
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:105)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:86)
>
>
> regards
> reinhard
>
>
>
>
>
>
> ?  Click here to submit conditions 
>
> This email and any content within or attached hereto from  Sun West 
Mortgage Company, Inc.  is confidential and/or legally privileged. The 
information is intended only for the use of the individual or entity named 
on this email. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or the taking of any 
action in reliance on the contents of this email information is strictly 
prohibited, and that the documents should be returned to this office 
immediately by email. Receipt by anyone other than the intended recipient 
is not a waiver of any privilege. Please do not include your social 
security number, account number, or any other personal or financial 
information in the content of the email. Should you have any questions, 
please call  (800) 453 7884. 




?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.   

Re: NPE in PDPageNode

Posted by reinhard schwab <re...@aon.at>.
adam,

im sorry. i neither dont know what program has been used nor do i know
the password or
how to remove the encryption.
i only can ask some other people about this.
i will open a jira issue and attach the file.

best regards
reinhard

Adam@swmc.com schrieb:
> Reinhard,
>
> The root element in your PDF references object 1554 as the object which 
> informs us of the pages within this document.  This object does not seem 
> to exist in the PDF, which is a violation of the PDF spec and why PDFBox 
> is unable to parse it.  You can open the PDF in a decent text editor and 
> search for 1554 and you'll see the Pages section which references this 
> object, but that's the only place it's found, there's no object 
> definition.
>
> Now, having said that, if we can find a reliable way to parse files like 
> these, we can update the code.  Do you know what program was used to 
> create this PDF?  Would it be possible for you to remove the encryption on 
> this file and try it again?  That would make it much easier to debug (if 
> it still crashes without the encryption, it might not).
>
> I also encourage you to create an issue of JIRA and upload this file there 
> (in case the link dies in the future).  https://issues.apache.org/jira
>
> ---- 
> Thanks,
> Adam
>
>
>
>
>
> From:
> reinhard schwab <re...@aon.at>
> To:
> dev@pdfbox.apache.org
> Date:
> 08/21/2010 11:42
> Subject:
> NPE in PDPageNode
>
>
>
> i get a nullpointer exception when parsing a pdf with tika.
>
> http://www.awsg.at/portal/media/4218.pdf
>
> java.lang.NullPointerException
>     at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
>     at
> org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
>     at
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:105)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:86)
>
>
> regards
> reinhard
>
>
>
>
>
>
> ?  Click here to submit conditions  
>
> This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.   


Re: NPE in PDPageNode

Posted by Ad...@swmc.com.
Reinhard,

The root element in your PDF references object 1554 as the object which 
informs us of the pages within this document.  This object does not seem 
to exist in the PDF, which is a violation of the PDF spec and why PDFBox 
is unable to parse it.  You can open the PDF in a decent text editor and 
search for 1554 and you'll see the Pages section which references this 
object, but that's the only place it's found, there's no object 
definition.

Now, having said that, if we can find a reliable way to parse files like 
these, we can update the code.  Do you know what program was used to 
create this PDF?  Would it be possible for you to remove the encryption on 
this file and try it again?  That would make it much easier to debug (if 
it still crashes without the encryption, it might not).

I also encourage you to create an issue of JIRA and upload this file there 
(in case the link dies in the future).  https://issues.apache.org/jira

---- 
Thanks,
Adam





From:
reinhard schwab <re...@aon.at>
To:
dev@pdfbox.apache.org
Date:
08/21/2010 11:42
Subject:
NPE in PDPageNode



i get a nullpointer exception when parsing a pdf with tika.

http://www.awsg.at/portal/media/4218.pdf

java.lang.NullPointerException
    at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
    at
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
    at
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:105)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:86)


regards
reinhard






?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage Company, Inc.  is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call  (800) 453 7884.