You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2013/03/13 19:44:14 UTC
[jira] [Closed] (PDFBOX-104) NullPointerException from using PDFTextStripper w/ some URLs

     [ https://issues.apache.org/jira/browse/PDFBOX-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-104.
-------------------------------------

    Resolution: Won't Fix
      Assignee: Andreas Lehmkühler

The link is no longer valid and therefore we don't have any sample pdf.

Set to closed
                
> NullPointerException from using PDFTextStripper w/ some URLs
> ------------------------------------------------------------
>
>                 Key: PDFBOX-104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-104
>             Project: PDFBox
>          Issue Type: Bug
>            Assignee: Andreas Lehmkühler
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1335941
> Originally submitted by pkorn78 on 2005-10-23 21:51.
> Hi,
> I've tried to extract text from this URL:
> http://www.ischool.washington.edu/mcdonald/papers/Lutters.AMCIS00.pdf
> However, it throws java.lang.NullPointerException on me
> although the PDF file does exist on that URL.
> Here's the code where I invoke the PDFTextStripper:
> String docURL =
> "http://www.ischool.washington.edu/mcdonald/papers/Lutters.AMCIS00.pdf";
> try{
>  PDFTextStripper stripper = new PDFTextStripper();
>  PDDocument pdfDoc =
> PDDocument.load(getInputStreamFromURL(docURL));
>  String strippedText = stripper.getText(pdfDoc);
>  pdfDoc.close();
>  return stippedText;
> }
> catch(IOException e){
>  e.printStackTrace();
> }
> Here's getInputStreamFromURL method used in the above code:
> public InputStream getInputStreamFromURL(String url){
>  try{
>   URL myURL = new URL(url);
>   URLConnection con = myURL.openConnection();
>   con.setRequestProperty("User-Agent",""); 
>   return con.getInputStream();
>  }
>  catch(MalformedURLException e){
>   e.printStackTrace();
>  }
>  catch(IOException ie){
>   ie.printStackTrace();
>  }
>  return null;
> }
> Please advise, Thanks
> -Palakorn
> [comment on SourceForge]
> Originally sent by brzrkr.
> Logged In: YES 
> user_id=1489602
> I am chasing a similar problem which might be related.
> What I found (using PDFDebugger.java) was that the PDF
> has an invalid PageTree: the root node and 5 of the 6
> Page objects are valid, but the 6th is a Stream, and 
> not a Dictionary(Page).  The offending code is:
>   private static COSArray 
>   org.pdfbox.pdmodel.PDPageNode.getAllKids( 
>       List result, COSDictionary page, boolean recurse
>   );
> which needs to check that kids is non-null before
> using it.
>     
> [comment on SourceForge]
> Originally sent by brzrkr.
> Logged In: YES 
> user_id=1489602
> I am chasing a similar problem which might be related.
> What I found (using PDFDebugger.java) was that the PDF
> has an invalid PageTree: the root node and 5 of the 6
> Page objects are valid, but the 6th is a Stream, and 
> not a Dictionary(Page).  The offending code is:
>   private static COSArray 
>   org.pdfbox.pdmodel.PDPageNode.getAllKids( 
>       List result, COSDictionary page, boolean recurse
>   );
> which needs to check that kids is non-null before
> using it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira