You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Richard Braman (JIRA)" <ji...@apache.org> on 2006/03/03 02:25:39 UTC

[jira] Created: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

PDF Box can't parse document: java.lang.NullPointerException
------------------------------------------------------------

         Key: NUTCH-220
         URL: http://issues.apache.org/jira/browse/NUTCH-220
     Project: Nutch
        Type: Bug
 Environment: PDFBox 0.7.2
    Reporter: Richard Braman


This error was fixed in the ltest build of PDFBOx, which should be tested with nutch.

>> 060228 160354 fetch okay, but can't parse
>> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>> failed(2,0): Can't be handled as pdf document. 
>> java.lang.NullPointerException

Yes, the NPE should be fixed.

 Ben

Richard Braman wrote:
> Hi Bn,
>
> We actually got to the bottom of all of them except for 1... The 
> content truncatetion was due to an inconsistancy bug in nutch config .
> The no permission to extract text is actually true, the author, the NC
> Department of revenue put this restriction on all of their files (I have
> asked them to remove it as it hampers public accessability).  The Null
> pointer exception is the only one to deal with that may be due to the
> parsing bug .  Is this one that you are referring to?
>
> -----Original Message-----
> From: Ben Litchfield [mailto:ben@csh.rit.edu]
> Sent: Thursday, March 02, 2006 4:07 PM
> To: Richard Braman
> Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> pdfbox-user@lists.sourceforge.net
> Subject: Re: [PDFBox-user] PDF Parse Error
>
>
>
> I believe these errors are due to a parsing bug in PDFBox that has 
> been fixed since the 0.7.2 release.  Please give the nightly 
> build(should be a drop in replacement) a try from 
> http://www.pdfbox.org/dist and let me know if you are still having 
> issues.
>
> Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

Posted by "Ben Litchfield (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372277 ] 

Ben Litchfield commented on NUTCH-220:
--------------------------------------

Actually, now that I look at the stack trace, the NPE is not happening in PDFBox code it appears to be in hadoop code, so I don't think that upgrading PDFBox will help.  

Ben

> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
>          Key: NUTCH-220
>          URL: http://issues.apache.org/jira/browse/NUTCH-220
>      Project: Nutch
>         Type: Bug
>  Environment: PDFBox 0.7.2
>     Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:ben@csh.rit.edu]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > pdfbox-user@lists.sourceforge.net
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

Posted by "Richard Braman (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372310 ] 

Richard Braman commented on NUTCH-220:
--------------------------------------

I upgraded nutch .8 trunk to PDFBox HEAD.

The NullPointer exception Seems to be resolved by upgrading nutch to PDFBox 0.7.3

The major issues in upgrading nutch to 0.7.3 are:

1.  PDFBOx now depends on Font Box, which must be included as a plugin lib-fontbox
2.  PDFBox no longer depends on log4j, when I tired to remove references to the dependency in the build.xml for porase-pdf, it returns assorted ant build errors,  I left the references to log4j and it built fine

someone who has more knowledge of building nutch needs to modify the build and plugin.xml if refernces to log4j should be removed?

plugin.xml for FontBox
<plugin
   id="lib-fontbox"
   name="FontBox"
   version="0.1.0-dev"
   provider-name="org.fontbox">
   <runtime>
     <library name="FontBox-0.1.0-dev.jar">
        <export name="*"/>
     </library>
   </runtime>
</plugin>

build.xml for lib-fontbox
<project name="lib-fontbox" default="jar">
  <import file="../build-plugin.xml"/>
  <!--
   ! Override the compile and jar targets,
   ! since there is nothing to compile here.
   ! -->
  <target name="compile" depends="init"/>
  <target name="jar" depends="compile">
    <copy todir="${build.dir}" verbose="true">
      <fileset dir="./lib" includes="**/*.jar"/>
    </copy>
  </target>
</project>

parse-pdf plugin.xml
<plugin
   id="parse-pdf"
   name="Pdf Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">
   <runtime>
      <library name="parse-pdf.jar">
         <export name="*"/>
      </library>
      <library name="PDFBox-0.7.3.jar"/>
      <library name="log4j-1.2.9.jar"/>
      <library name="FontBox-0.1.0-dev.jar"/>
   </runtime>
   <requires>
      <import plugin="nutch-extensionpoints"/>
      <import plugin="lib-log4j"/>
      <import plugin="lib-fontbox"/>
   </requires>
   <extension id="org.apache.nutch.parse.pdf"
              name="PdfParse"
              point="org.apache.nutch.parse.Parser">
      <implementation id="org.apache.nutch.parse.pdf.PdfParser"
                      class="org.apache.nutch.parse.pdf.PdfParser"
                      contentType="application/pdf"
                      pathSuffix=""/>

   </extension>
</plugin>

parse-pdf build.xml
<project name="parse-pdf" default="jar-core">
  <import file="../build-plugin.xml"/>
  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-log4j"/>
    <ant target="jar" inheritall="false" dir="../lib-fontbox"/> 
  </target>
  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-log4j/*.jar" />
      <include name="**/lib-fontbox/*.jar" />  
    </fileset>
  </path>
  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-log4j"/>
    <ant target="deploy" inheritall="false" dir="../lib-fontbox"/>  
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
    <ant target="deploy" inheritall="false" dir="../protocol-file"/>
  </target>
  <!-- for junit test -->
  <mkdir dir="${build.test}/data"/>
  <copy file="sample/pdftest.pdf" todir="${build.test}/data"/>
</project>

> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
>          Key: NUTCH-220
>          URL: http://issues.apache.org/jira/browse/NUTCH-220
>      Project: Nutch
>         Type: Bug
>  Environment: PDFBox 0.7.2
>     Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:ben@csh.rit.edu]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > pdfbox-user@lists.sourceforge.net
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

Posted by "Richard Braman (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372275 ] 

Richard Braman commented on NUTCH-220:
--------------------------------------

PDFBox-0.7.3 no longer depends on log4j at all, so you should not be
getting any log4j errors from PDFBox!

Ben


On Sun, 26 Mar 2006, Richard Braman wrote:

> > Hi Ben,
> > I noticed that the nutch uses a log4j version of PDFBox.jar.  I don't
> > see this as an ant target on 0.7.3 .  I downloaded pdfbox from CVS Head.
> >
> > When I tried to use the PDFBox nightly it gave me a bunch of log4j
> > errors, so I guess nutch expects the log4j version.
> >
> > I am trying to upgrade my nutch to 0.7.3 to see if I can get arid of the
> > NPE error.
> >
> > The NPE bug I told you about a few weeks ago is much worse effect in
> > Nutch .8, as it seems to cause the fetcher to abort.
> >
> > 060326 142450 fetch of
> > http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calculation.pdf
> > failed with: java.lang.NullPointerException
> > java.lang.NullPointerException
> >     at
> > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180)
> >     at
> > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171)
> >     at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
> >     at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:245)
> >     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185)
> > 060326 142450 SEVERE fetcher caught:java.lang.NullPointerException
> >
> > --
> > Richard L Braman, Jr., CPA
> > Tax Code Software Foundation, Inc.
> > Open Source Tax Software
> > http://www.taxcodesoftware.org
> > rbraman@taxcodesoftware.org
> >


> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
>          Key: NUTCH-220
>          URL: http://issues.apache.org/jira/browse/NUTCH-220
>      Project: Nutch
>         Type: Bug
>  Environment: PDFBox 0.7.2
>     Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:ben@csh.rit.edu]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > pdfbox-user@lists.sourceforge.net
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371669 ] 

Chris A. Mattmann commented on NUTCH-220:
-----------------------------------------

Could you provide some more detail on this issue? For instance, a stack trace here would be quite helpful in trying to debug this problem. 

> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
>          Key: NUTCH-220
>          URL: http://issues.apache.org/jira/browse/NUTCH-220
>      Project: Nutch
>         Type: Bug
>  Environment: PDFBox 0.7.2
>     Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:ben@csh.rit.edu]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > pdfbox-user@lists.sourceforge.net
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

Posted by "Richard Braman (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371887 ] 

Richard Braman commented on NUTCH-220:
--------------------------------------

Here is an example of the error from my log file.  It seems it was fixed with the latest PDFBox pre Ben Litchfiled, developer of PDF Box.


060325 212856 fetch of http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calcul
ation.pdf failed with: java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180
)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171
)
        at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:24
5)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185)
060325 212856 SEVERE fetcher caught:java.lang.NullPointerException

> PDF Box can't parse document: java.lang.NullPointerException
> ------------------------------------------------------------
>
>          Key: NUTCH-220
>          URL: http://issues.apache.org/jira/browse/NUTCH-220
>      Project: Nutch
>         Type: Bug
>  Environment: PDFBox 0.7.2
>     Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -----Original Message-----
> > From: Ben Litchfield [mailto:ben@csh.rit.edu]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > pdfbox-user@lists.sourceforge.net
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira