You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ahmad ajiloo <ah...@gmail.com> on 2011/10/04 19:59:10 UTC

error on fetching pdf and doc files

Hi
I want to crawl with this seed:
http://shce.sums.ac.ir/articles/farsi.html

but when fetching operation arrives to pdf and doc files give me some errors
like these:
--------------------------------------------------------------------------------
ParseSegment: starting at 2011-10-04 21:08:05
ParseSegment: segment: crawl-2/segments/20111004210620
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/department/technical/pdf/brca.doc:
failed(2,0): Your file contains 124 sectors, but the initial DIFAT array at
index 0 referenced block # 151. This isn't allowed and  your file is corrupt
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/asthmprevention.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/bicycle_safety.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca2.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca3.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca4.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca5.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/cancerrisks.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/cellphonehazard.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/chol.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/coronarydisprevention.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/diabetescontrol.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/diabeteshandouts.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/farsidiabetes.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/quitsession.pdf:
failed(2,0): null
Error parsing:
http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/thalassemia3.pdf:
failed(2,0): null
ParseSegment: finished at 2011-10-04 21:08:07, elapsed: 00:00:02
-------------------------------------------------------------------
can anyone help me?

Re: error on fetching pdf and doc files

Posted by ahmad ajiloo <ah...@gmail.com>.
thanks. it was about 65 KB !!!

On Tue, Oct 4, 2011 at 9:34 PM, Markus Jelsma <ma...@openindex.io>wrote:

> check your http.content.limit, i can at least parse one of your files
> correctly.
>
> > Hi
> > I want to crawl with this seed:
> > http://shce.sums.ac.ir/articles/farsi.html
> >
> > but when fetching operation arrives to pdf and doc files give me some
> > errors like these:
> >
> ---------------------------------------------------------------------------
> > ----- ParseSegment: starting at 2011-10-04 21:08:05
> > ParseSegment: segment: crawl-2/segments/20111004210620
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/department/technical/pd
> > f/brca.doc: failed(2,0): Your file contains 124 sectors, but the initial
> > DIFAT array at index 0 referenced block # 151. This isn't allowed and
> > your file is corrupt Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/asthmpreventio
> > n.pdf: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/bicycle_safety
> > .pdf: failed(2,0): null
> > Error parsing:
> > http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca2.pdf:
> > failed(2,0): null
> > Error parsing:
> > http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca3.pdf:
> > failed(2,0): null
> > Error parsing:
> > http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca4.pdf:
> > failed(2,0): null
> > Error parsing:
> > http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca5.pdf:
> > failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/cancerrisks.pd
> > f: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/cellphonehazar
> > d.pdf: failed(2,0): null
> > Error parsing:
> > http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/chol.pdf:
> > failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/coronarydispre
> > vention.pdf: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/diabetescontro
> > l.pdf: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/diabeteshandou
> > ts.pdf: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/farsidiabetes
> .
> > pdf: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/quitsession.pd
> > f: failed(2,0): null
> > Error parsing:
> >
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/thalassemia3.p
> > df: failed(2,0): null
> > ParseSegment: finished at 2011-10-04 21:08:07, elapsed: 00:00:02
> > -------------------------------------------------------------------
> > can anyone help me?
>

Re: error on fetching pdf and doc files

Posted by Markus Jelsma <ma...@openindex.io>.
check your http.content.limit, i can at least parse one of your files 
correctly.

> Hi
> I want to crawl with this seed:
> http://shce.sums.ac.ir/articles/farsi.html
> 
> but when fetching operation arrives to pdf and doc files give me some
> errors like these:
> ---------------------------------------------------------------------------
> ----- ParseSegment: starting at 2011-10-04 21:08:05
> ParseSegment: segment: crawl-2/segments/20111004210620
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/department/technical/pd
> f/brca.doc: failed(2,0): Your file contains 124 sectors, but the initial
> DIFAT array at index 0 referenced block # 151. This isn't allowed and 
> your file is corrupt Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/asthmpreventio
> n.pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/bicycle_safety
> .pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca2.pdf:
> failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca3.pdf:
> failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca4.pdf:
> failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/ca5.pdf:
> failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/cancerrisks.pd
> f: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/cellphonehazar
> d.pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/chol.pdf:
> failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/coronarydispre
> vention.pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/diabetescontro
> l.pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/diabeteshandou
> ts.pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/farsidiabetes.
> pdf: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/quitsession.pd
> f: failed(2,0): null
> Error parsing:
> http://shce.sums.ac.ir/icarusplus/export/sites/shce/download/thalassemia3.p
> df: failed(2,0): null
> ParseSegment: finished at 2011-10-04 21:08:07, elapsed: 00:00:02
> -------------------------------------------------------------------
> can anyone help me?