You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/02/27 21:06:08 UTC

Large Shared Drive Crawl

I am attempting to crawl a very large intranet file system using Nutch and I
am having some issues.  At one point in the crawl cycle I get an java heap
space error during fetching.  I think it is related to the number of URLS
listed in the segment to be fetched.  I do want to crawl/index EVERYTHING on
this share drive, but I think the shear number of folders listed and files
in some directories are hosing things up. 

So my question is.. will changing topN to a small number allow me to
eventually get all the urls in this shared drive (after many, many generate
->fetch ->parse->updatedb->invertlinks->solrindex cycles).  I recently
upgraded to 1.4 and I don't see the depth parameter any more? If it is still
around, would that be a possible way to short the cycle which would keep the
memory usage down during each cycle?  

Anything else I am missing? 

Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3781917.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

Posted by webdev1977 <we...@gmail.com>.

What is a reasonable number of threads?   What about memory?  Where is the
best place to set that in the nutch script? in one of the config files.  

I abandoned using distributed mode (10 slaves), it was taking WAYYYYY to
long to crawl the web and share drives in my enterprise, not to mention I am
running entirely on a windows platform and I think that hadoop is having
some issues on the namenode (it shuts down after running for a few hours)

I get an OOM error during the fetch cycle:

java.lang.OutofMemoryError: java heap space
   at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.ajava:342)
   ....

This is after several file 404 errors (some directories and files are locked
down, hence the 404) as well as several java.lang.IllegalArgumentException:
URLDecoder: Illegal Hex characters in escape (%) pattern - For input string:
" G"





--
View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783800.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

Posted by Markus Jelsma <ma...@openindex.io>.

> I guess I don't mind using topN as long as I can be assured that I will get
> ALL of the urls crawled eventually. Do you know if that is a true
> statement?

That is true. The cycle will continue until all records are exhausted. You 
just need more cycles. Also consider using maxSegments to generate a larger 
amoutn of records per cycle. Much more efficient with generating and updating.

On Tuesday 28 February 2012 12:22:47 webdev1977 wrote:
> OH.. forgot to say.. no I am not parsing while fetching.  I had more
> problems with that so I turned it off.

If you have heap issues during fetching (with or without parsing) you have too 
little RAM allocated and too many threads. If you have heap issues when the 
fetcher (with or without parsing) is finalizing (the shuffle, sort and reduce 
part after the last record is fetched) then you have bad settings (e.g. 
io.sort.mb and io.sort.factor) and again too many records with too little heap 
space allocated.

When using a parsing fetcher you must significantly reduce the number of 
fetcher threads!!

> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783
> 706.html Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Large Shared Drive Crawl

Posted by webdev1977 <we...@gmail.com>.

OH.. forgot to say.. no I am not parsing while fetching.  I had more problems
with that so I turned it off.

--
View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783706.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

Posted by webdev1977 <we...@gmail.com>.

Thanks for the reply!

I guess I don't mind using topN as long as I can be assured that I will get
ALL of the urls crawled eventually. Do you know if that is a true statement?

--
View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783663.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

Posted by Ferdy Galema <fe...@kalooga.com>.

I am not sure about the depth paramater, but yes using a smaller topn
should reduce the changes for heap errors. However even small batches can
be problematic, for example when the fetched urls are expensive to parse.
(Are you parsing during fetch?). I recommended on looking at the logs in
trying to pinpoint the exact cause.


On Mon, Feb 27, 2012 at 9:06 PM, webdev1977 <we...@gmail.com> wrote:

> I am attempting to crawl a very large intranet file system using Nutch and
> I
> am having some issues.  At one point in the crawl cycle I get an java heap
> space error during fetching.  I think it is related to the number of URLS
> listed in the segment to be fetched.  I do want to crawl/index EVERYTHING
> on
> this share drive, but I think the shear number of folders listed and files
> in some directories are hosing things up.
>
> So my question is.. will changing topN to a small number allow me to
> eventually get all the urls in this shared drive (after many, many generate
> ->fetch ->parse->updatedb->invertlinks->solrindex cycles).  I recently
> upgraded to 1.4 and I don't see the depth parameter any more? If it is
> still
> around, would that be a possible way to short the cycle which would keep
> the
> memory usage down during each cycle?
>
> Anything else I am missing?
>
> Thanks!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3781917.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>