You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/01 00:12:43 UTC

RE: PDF Parse Error

I set it to 0, there are some big pdfs on the sites I am crawlign.
Thanks Jeff.

-----Original Message-----
From: Jeff Ritchie [mailto:jritchie@netwurklabs.com] 
Sent: Tuesday, February 28, 2006 4:37 PM
To: nutch-dev@lucene.apache.org
Subject: Re: PDF Parse Error


In nutch-site.xml
Set it to something like

<property>
<name>http.content.limit</name>
<value>655360</value>
</property>

Jeff.


Richard Braman wrote:

>I get the following errors regarding pdf:
> 
>060228 160518 fetch okay, but can't parse 
>http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_h
>i
>.pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
>can't handle incomplete pdf file.
> 
>060228 160354 fetch okay, but can't parse 
>http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>failed(2,0): Can't be handled as pdf document. 
>java.lang.NullPointerException
> 
>060228 160518 fetch okay, but can't parse 
>http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Instr
>u
>ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
>java.io.IOException: You do not have permission to extract text
> 
>I have a number of errors like this in my log, mostly the content 
>truncated one.
> 
>The thing is these files all open fine in acrobat.
> 
> 
>
>Richard Braman
>mailto:rbraman@taxcodesoftware.org
>561.748.4002 (voice)
>
>http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
>Free Open Source Tax Software
>
> 
>
>  
>