You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/03 01:11:24 UTC

RE: [PDFBox-user] PDF Parse Error

Hi Bn,

We actually got to the bottom of all of them except for 1... The content
truncatetion was due to an inconsistancy bug in nutch config .  
The no permission to extract text is actually true, the author, the NC
Department of revenue put this restriction on all of their files (I have
asked them to remove it as it hampers public accessability).  The Null
pointer exception is the only one to deal with that may be due to the
parsing bug .  Is this one that you are referring to?

-----Original Message-----
From: Ben Litchfield [mailto:ben@csh.rit.edu] 
Sent: Thursday, March 02, 2006 4:07 PM
To: Richard Braman
Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
pdfbox-user@lists.sourceforge.net
Subject: Re: [PDFBox-user] PDF Parse Error

I believe these errors are due to a parsing bug in PDFBox that has been
fixed since the 0.7.2 release.  Please give the nightly build(should be
a drop in replacement) a try from http://www.pdfbox.org/dist and let me
know if you are still having issues.

Ben

On Tue, 28 Feb 2006, Richard Braman wrote:

> I get the following errors regarding pdf:
>
> 060228 160518 fetch okay, but can't parse 
> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_
> hi
> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
> can't handle incomplete pdf file.
>
> 060228 160354 fetch okay, but can't parse 
> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> failed(2,0): Can't be handled as pdf document. 
> java.lang.NullPointerException
>
> 060228 160518 fetch okay, but can't parse 
> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Inst
> ru
> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
> java.io.IOException: You do not have permission to extract text
>
> I have a number of errors like this in my log, mostly the content 
> truncated one.
>
> The thing is these files all open fine in acrobat.
>
>
>
> Richard Braman
> mailto:rbraman@taxcodesoftware.org
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free 
> Open Source Tax Software
>
>
>

Re: [PDFBox-user] PDF Parse Error

Posted by Ben Litchfield <be...@benlitchfield.com>.

Yes, the NPE should be fixed.

 Ben

Richard Braman wrote:
> Hi Bn,
>
> We actually got to the bottom of all of them except for 1... The content
> truncatetion was due to an inconsistancy bug in nutch config .  
> The no permission to extract text is actually true, the author, the NC
> Department of revenue put this restriction on all of their files (I have
> asked them to remove it as it hampers public accessability).  The Null
> pointer exception is the only one to deal with that may be due to the
> parsing bug .  Is this one that you are referring to?
>
> -----Original Message-----
> From: Ben Litchfield [mailto:ben@csh.rit.edu] 
> Sent: Thursday, March 02, 2006 4:07 PM
> To: Richard Braman
> Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> pdfbox-user@lists.sourceforge.net
> Subject: Re: [PDFBox-user] PDF Parse Error
>
>
>
> I believe these errors are due to a parsing bug in PDFBox that has been
> fixed since the 0.7.2 release.  Please give the nightly build(should be
> a drop in replacement) a try from http://www.pdfbox.org/dist and let me
> know if you are still having issues.
>
> Ben
>
>
>
> On Tue, 28 Feb 2006, Richard Braman wrote:
>
>   
>> I get the following errors regarding pdf:
>>
>> 060228 160518 fetch okay, but can't parse 
>> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_
>> hi
>> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
>> can't handle incomplete pdf file.
>>
>> 060228 160354 fetch okay, but can't parse 
>> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>> failed(2,0): Can't be handled as pdf document. 
>> java.lang.NullPointerException
>>
>> 060228 160518 fetch okay, but can't parse 
>> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Inst
>> ru
>> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
>> java.io.IOException: You do not have permission to extract text
>>
>> I have a number of errors like this in my log, mostly the content 
>> truncated one.
>>
>> The thing is these files all open fine in acrobat.
>>
>>
>>
>> Richard Braman
>> mailto:rbraman@taxcodesoftware.org
>> 561.748.4002 (voice)
>>
>> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free 
>> Open Source Tax Software
>>
>>
>>
>>     
>
>

Re: [PDFBox-user] PDF Parse Error

Posted by Ben Litchfield <be...@benlitchfield.com>.

Yes, the NPE should be fixed.

 Ben

Richard Braman wrote:
> Hi Bn,
>
> We actually got to the bottom of all of them except for 1... The content
> truncatetion was due to an inconsistancy bug in nutch config .  
> The no permission to extract text is actually true, the author, the NC
> Department of revenue put this restriction on all of their files (I have
> asked them to remove it as it hampers public accessability).  The Null
> pointer exception is the only one to deal with that may be due to the
> parsing bug .  Is this one that you are referring to?
>
> -----Original Message-----
> From: Ben Litchfield [mailto:ben@csh.rit.edu] 
> Sent: Thursday, March 02, 2006 4:07 PM
> To: Richard Braman
> Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> pdfbox-user@lists.sourceforge.net
> Subject: Re: [PDFBox-user] PDF Parse Error
>
>
>
> I believe these errors are due to a parsing bug in PDFBox that has been
> fixed since the 0.7.2 release.  Please give the nightly build(should be
> a drop in replacement) a try from http://www.pdfbox.org/dist and let me
> know if you are still having issues.
>
> Ben
>
>
>
> On Tue, 28 Feb 2006, Richard Braman wrote:
>
>   
>> I get the following errors regarding pdf:
>>
>> 060228 160518 fetch okay, but can't parse 
>> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_
>> hi
>> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
>> can't handle incomplete pdf file.
>>
>> 060228 160354 fetch okay, but can't parse 
>> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>> failed(2,0): Can't be handled as pdf document. 
>> java.lang.NullPointerException
>>
>> 060228 160518 fetch okay, but can't parse 
>> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Inst
>> ru
>> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
>> java.io.IOException: You do not have permission to extract text
>>
>> I have a number of errors like this in my log, mostly the content 
>> truncated one.
>>
>> The thing is these files all open fine in acrobat.
>>
>>
>>
>> Richard Braman
>> mailto:rbraman@taxcodesoftware.org
>> 561.748.4002 (voice)
>>
>> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free 
>> Open Source Tax Software
>>
>>
>>
>>     
>
>