You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2013/03/13 19:44:14 UTC
[jira] [Closed] (PDFBOX-104) NullPointerException from using
PDFTextStripper w/ some URLs
[ https://issues.apache.org/jira/browse/PDFBOX-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler closed PDFBOX-104.
-------------------------------------
Resolution: Won't Fix
Assignee: Andreas Lehmkühler
The link is no longer valid and therefore we don't have any sample pdf.
Set to closed
> NullPointerException from using PDFTextStripper w/ some URLs
> ------------------------------------------------------------
>
> Key: PDFBOX-104
> URL: https://issues.apache.org/jira/browse/PDFBOX-104
> Project: PDFBox
> Issue Type: Bug
> Assignee: Andreas Lehmkühler
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1335941
> Originally submitted by pkorn78 on 2005-10-23 21:51.
> Hi,
> I've tried to extract text from this URL:
> http://www.ischool.washington.edu/mcdonald/papers/Lutters.AMCIS00.pdf
> However, it throws java.lang.NullPointerException on me
> although the PDF file does exist on that URL.
> Here's the code where I invoke the PDFTextStripper:
> String docURL =
> "http://www.ischool.washington.edu/mcdonald/papers/Lutters.AMCIS00.pdf";
> try{
> PDFTextStripper stripper = new PDFTextStripper();
> PDDocument pdfDoc =
> PDDocument.load(getInputStreamFromURL(docURL));
> String strippedText = stripper.getText(pdfDoc);
> pdfDoc.close();
> return stippedText;
> }
> catch(IOException e){
> e.printStackTrace();
> }
> Here's getInputStreamFromURL method used in the above code:
> public InputStream getInputStreamFromURL(String url){
> try{
> URL myURL = new URL(url);
> URLConnection con = myURL.openConnection();
> con.setRequestProperty("User-Agent","");
> return con.getInputStream();
> }
> catch(MalformedURLException e){
> e.printStackTrace();
> }
> catch(IOException ie){
> ie.printStackTrace();
> }
> return null;
> }
> Please advise, Thanks
> -Palakorn
> [comment on SourceForge]
> Originally sent by brzrkr.
> Logged In: YES
> user_id=1489602
> I am chasing a similar problem which might be related.
> What I found (using PDFDebugger.java) was that the PDF
> has an invalid PageTree: the root node and 5 of the 6
> Page objects are valid, but the 6th is a Stream, and
> not a Dictionary(Page). The offending code is:
> private static COSArray
> org.pdfbox.pdmodel.PDPageNode.getAllKids(
> List result, COSDictionary page, boolean recurse
> );
> which needs to check that kids is non-null before
> using it.
>
> [comment on SourceForge]
> Originally sent by brzrkr.
> Logged In: YES
> user_id=1489602
> I am chasing a similar problem which might be related.
> What I found (using PDFDebugger.java) was that the PDF
> has an invalid PageTree: the root node and 5 of the 6
> Page objects are valid, but the 6th is a Stream, and
> not a Dictionary(Page). The offending code is:
> private static COSArray
> org.pdfbox.pdmodel.PDPageNode.getAllKids(
> List result, COSDictionary page, boolean recurse
> );
> which needs to check that kids is non-null before
> using it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira