You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Declan Newman <go...@gmail.com> on 2009/01/13 14:28:16 UTC

NullPointerException on org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode)

Hello everyone,

I'm new, so please be gentle with me.

We are using PDFBox to extract text from a large amount of PDFs (approx. 
80,000) in preparation for indexing in Solr/Lucene.

In order to do this, we use the 
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in order 
to iterate over the pages and strip the contents using the 
PDFTextStripper a page at a time.

The vast majority are fine, but approx. 0.8% suffer from a 
NullPointerException when it reaches 
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)

I'm currently working from the trunk after seeing a similar problem in 
the archives 
(<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.mbox/%3COF15421546.54F415DC-ON862574BA.006A9E36-862574BA.006AD0A8@uscmail.uscourts.gov%3E>) 
but unfortunately it hasn't solved the issue.

The stack trace is:

Caused by: java.lang.NullPointerException
              : at 
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)
              : at 
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:754)
              : at 
com.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtractor.java:71)
              : at 
com.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExtractor.java:56)
              : at com.semantico.depp.task.JobTask.doJob(JobTask.java:129)

Having delved into the code, the "page" variable is null when:

page.getDictionaryObject( COSName.COUNT )).intValue()

is called in PDPageNode.getCount(PDPageNode)

I understand that not all PDFs can be supported, and to be honest I 
think 99.2% is amazing. I just thought I would post this in the hopes 
that someone has come across it before.

Thanks for any help.

Regards,

Declan

RE: NullPointerException on org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode)

Posted by Pe...@ibi.com.
I ran the PMD tool from IBM and the Findbugs from the University of
Maryland to see if they could find a problem in the file PDPageNode
.java that you reported.

Unfortunately, they didn't find any problems that would address your
issue.
They only reported an insignificant issue in this file that had nothing
to do with your problem. 

I am posting this issue, just because someone may find your problem, if
they make this change at the same time. It will be a small improvement
in memory management when you convert large numbers of document. This is
not a major issue at all.

Below are the comments from Findbugs, PMD also reported the same
problem, but findbugs gives a better description of the issue.
-------------------------------------------------------------------
[M P Bx] Method invokes inefficient Number constructor; use static
valueOf instead [DM_NUMBER_CTOR]

Using new Integer(int) is guaranteed to always result in a new object
whereas Integer.valueOf(int) allows caching of values to be done by the
compiler, class library, or JVM. Using of cached values avoids object
allocation and the code will be faster. 

Values between -128 and 127 are guaranteed to have corresponding cached
instances and using valueOf is approximately 3.5 times faster than using
constructor. For values outside the constant range the performance of
both styles is the same. 

Unless the class must be compatible with JVMs predating Java 1.5, use
either autoboxing or the valueOf() method when creating instances of
Long, Integer, Short, Character, and Byte.


    public Integer getRotation()
    {
        Integer retval = null;
        COSNumber value = (COSNumber)page.getDictionaryObject(
COSName.ROTATE );
        if( value != null )
        {
//Change this:
            retval = new Integer( value.intValue() );
// to this, so the first 127 rotation numbers will be cached Integer
values:
            retval = Integer.valueOf(value.intValue());
        }
        return retval;
    }

    /**
     * This will fin

-----Original Message-----
From: Declan Newman [mailto:googlydec@gmail.com] 
Sent: Tuesday, January 13, 2009 8:28 AM
To: PDFBox Users
Subject: NullPointerException on
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode)

Hello everyone,

I'm new, so please be gentle with me.

We are using PDFBox to extract text from a large amount of PDFs (approx.

80,000) in preparation for indexing in Solr/Lucene.

In order to do this, we use the 
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages() method in order 
to iterate over the pages and strip the contents using the 
PDFTextStripper a page at a time.

The vast majority are fine, but approx. 0.8% suffer from a 
NullPointerException when it reaches 
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)

I'm currently working from the trunk after seeing a similar problem in 
the archives 
(<http://mail-archives.apache.org/mod_mbox/incubator-pdfbox-dev/200809.m
box/%3COF15421546.54F415DC-ON862574BA.006A9E36-862574BA.006AD0A8@uscmail
.uscourts.gov%3E>) 
but unfortunately it hasn't solved the issue.

The stack trace is:

Caused by: java.lang.NullPointerException
              : at 
org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:102)
              : at 
org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:75
4)
              : at 
com.semantico.depp.extractor.PDFBoxPdfExtractor.writeText(PDFBoxPdfExtra
ctor.java:71)
              : at 
com.semantico.depp.extractor.PDFBoxPdfExtractor.extractText(PDFBoxPdfExt
ractor.java:56)
              : at
com.semantico.depp.task.JobTask.doJob(JobTask.java:129)

Having delved into the code, the "page" variable is null when:

page.getDictionaryObject( COSName.COUNT )).intValue()

is called in PDPageNode.getCount(PDPageNode)

I understand that not all PDFs can be supported, and to be honest I 
think 99.2% is amazing. I just thought I would post this in the hopes 
that someone has come across it before.

Thanks for any help.

Regards,

Declan