You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/01 00:12:11 UTC

RE: PDF Parse Error

I should have seen that in Wiki FAQ.  Thanks Jerome.  Will these pages
get refecthed next time or do will they wait 30 days?

-----Original Message-----
From: Jérôme Charron [mailto:jerome.charron@gmail.com] 
Sent: Tuesday, February 28, 2006 4:19 PM
To: nutch-user@lucene.apache.org; rbraman@bramantax.com
Subject: Re: PDF Parse Error


Edit your nutch-site.xml (or nutch-default.xml) and change the
http.content.limit (set it to 0 if you don't want no content truncation
at all).

Jérôme

On 2/28/06, Richard Braman <rb...@bramantax.com> wrote:
>
> I get the following errors regarding pdf:
>
> 060228 160518 fetch okay, but can't parse 
> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_
> hi
> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
> can't handle incomplete pdf file.
>
> 060228 160354 fetch okay, but can't parse 
> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> failed(2,0): Can't be handled as pdf document. 
> java.lang.NullPointerException
>
> 060228 160518 fetch okay, but can't parse 
> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Inst
> ru
> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
> java.io.IOException: You do not have permission to extract text
>
> I have a number of errors like this in my log, mostly the content 
> truncated one.
>
> The thing is these files all open fine in acrobat.
>
>
>
> Richard Braman
> mailto:rbraman@taxcodesoftware.org
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free 
> Open Source Tax Software
>
>
>
>


--
http://motrech.free.fr/
http://www.frutch.org/