You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/10/13 01:21:53 UTC

clustering strategies

I think it would be nice to have a few cluster
strategies on the wiki.

It seems there are at least three separate needs: CPU,
storage and bandwidth, and I think the more those
could be cleanly spread to different boxes, the
better.  

Guess I am imagining a breakdown that lists, by
priority, how things should be broken out.  So someone
could look at the list and say, ok, I have three good
boxes, I should make the best box do x, the second
best do y, etc.  There could also be case studies for
how different folks did their own implementations and
what their crawl/query times were like.

I have a small cluster (up to 15 boxes) and would like
to start to play around and see how things go under
different strategies.  I also have about a million
pages of local content, so I can hammer things pretty
hard without even leaving my network.  I know that may
not match normal conditions, but it could hopefully
remove a variable or two (network latency, slow
sites), to keep things simple at least to start.

I think it also a decent goal to be able to
crawl/index my pages in a night (say eight hours),
which would be around 35 pages/second.  If that isn't
a reasonable goal, I would like to hear why not.

For each strategy, we could have a set of confs
describing how to set things up.  I can picture a gui
which could list box roles (crawler, mapper, whatever)
and boxes available.  The users could drag and drop
their boxes to roles, and confs could then be
generated.  Think it could make for rather easy
design/implementation of clusters that could get
rather complicated.  I can do drag/drop and
interpolate into templates in javascript, so I could
envision a rather simple page.

Maybe we could even store the cluster setup in xml,
and have a script that takes the xml and draws the
cluster.  Then when people report slowness or the
like, they could also post their cluster setup.

I think when users come to nutch, they come with a set
of boxes.  I think it would be nice for them to see
what has worked for such a set of boxes in the past
and be able to easily implement such a strategy.  Kind
of the one hour from download to spidering vision.

Just a few thoughts.

Earl


	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: clustering strategies

Posted by Kelvin Tan <ke...@relevanz.com>.
Earl, for a start, since you're crawling your local network and hammering it is not a problem, have you also tried disabling stuff like robots checking, and the server wait delay?

On Fri, 14 Oct 2005 17:14:23 -0700 (PDT), Earl Cahill wrote:
> Well, I think strangely, not a lot of interest here.
>
> My main concern is that I am trying to crawl content a hop away,
> and I can't really do it very fast.  Once I start my mapred crawl,
> it spends most of the time map reducing and very little time
> actually getting pages. I made the change doug suggested
> (fetcher.threads.per.host=100, http.max.delays=0), and the crawl
> still goes very slow.  I have several other boxes I can use (two
> "good" boxes, several other boxes), just not sure how best to
> spread the jobs, storage and the like.
>
> Again, I would like to crawl about a million local pages in a night.
>
> Feedback would be appreciated.
>
> Thanks,
> Earl
>
> --- Earl Cahill <ca...@yahoo.com> wrote:
>
>> I think it would be nice to have a few cluster strategies on the
>> wiki.
>>
>> It seems there are at least three separate needs: CPU,
>> storage and bandwidth, and I think the more those could be
>> cleanly spread to different boxes, the better.
>>
>> Guess I am imagining a breakdown that lists, by priority, how
>> things should be broken out.  So someone
>> could look at the list and say, ok, I have three good
>> boxes, I should make the best box do x, the second best do y,
>> etc.  There could also be case studies for
>> how different folks did their own implementations and
>> what their crawl/query times were like.
>>
>> I have a small cluster (up to 15 boxes) and would like
>> to start to play around and see how things go under different
>> strategies.  I also have about a million pages of local content,
>> so I can hammer things pretty
>> hard without even leaving my network.  I know that may
>> not match normal conditions, but it could hopefully remove a
>> variable or two (network latency, slow sites), to keep things
>> simple at least to start.
>>
>> I think it also a decent goal to be able to crawl/index my pages
>> in a night (say eight hours), which would be around 35
>> pages/second.  If that isn't
>> a reasonable goal, I would like to hear why not.
>>
>> For each strategy, we could have a set of confs describing how to
>> set things up.  I can picture a gui
>> which could list box roles (crawler, mapper, whatever)
>> and boxes available.  The users could drag and drop their boxes
>> to roles, and confs could then be generated.  Think it could make
>> for rather easy design/implementation of clusters that could get
>> rather complicated.  I can do drag/drop and interpolate into
>> templates in javascript, so I could envision a rather simple page.
>>
>> Maybe we could even store the cluster setup in xml, and have a
>> script that takes the xml and draws the cluster.  Then when
>> people report slowness or the like, they could also post their
>> cluster setup.
>>
>> I think when users come to nutch, they come with a set
>> of boxes.  I think it would be nice for them to see what has
>> worked for such a set of boxes in the past and be able to easily
>> implement such a strategy. Kind
>> of the one hour from download to spidering vision.
>>
>> Just a few thoughts.
>>
>> Earl
>>
>>
>> __________________________________ Yahoo! Mail - PC Magazine
>> Editors' Choice 2005 http://mail.yahoo.com
>
>
> __________________________________ Yahoo! Mail - PC Magazine
> Editors' Choice 2005 http://mail.yahoo.com



Re: clustering strategies

Posted by Earl Cahill <ca...@yahoo.com>.
Well, I think strangely, not a lot of interest here.  

My main concern is that I am trying to crawl content a
hop away, and I can't really do it very fast.  Once I
start my mapred crawl, it spends most of the time map
reducing and very little time actually getting pages. 
I made the change doug suggested
(fetcher.threads.per.host=100, http.max.delays=0), and
the crawl still goes very slow.  I have several other
boxes I can use (two "good" boxes, several other
boxes), just not sure how best to spread the jobs,
storage and the like.  

Again, I would like to crawl about a million local
pages in a night.

Feedback would be appreciated.

Thanks,
Earl

--- Earl Cahill <ca...@yahoo.com> wrote:

> I think it would be nice to have a few cluster
> strategies on the wiki.
> 
> It seems there are at least three separate needs:
> CPU,
> storage and bandwidth, and I think the more those
> could be cleanly spread to different boxes, the
> better.  
> 
> Guess I am imagining a breakdown that lists, by
> priority, how things should be broken out.  So
> someone
> could look at the list and say, ok, I have three
> good
> boxes, I should make the best box do x, the second
> best do y, etc.  There could also be case studies
> for
> how different folks did their own implementations
> and
> what their crawl/query times were like.
> 
> I have a small cluster (up to 15 boxes) and would
> like
> to start to play around and see how things go under
> different strategies.  I also have about a million
> pages of local content, so I can hammer things
> pretty
> hard without even leaving my network.  I know that
> may
> not match normal conditions, but it could hopefully
> remove a variable or two (network latency, slow
> sites), to keep things simple at least to start.
> 
> I think it also a decent goal to be able to
> crawl/index my pages in a night (say eight hours),
> which would be around 35 pages/second.  If that
> isn't
> a reasonable goal, I would like to hear why not.
> 
> For each strategy, we could have a set of confs
> describing how to set things up.  I can picture a
> gui
> which could list box roles (crawler, mapper,
> whatever)
> and boxes available.  The users could drag and drop
> their boxes to roles, and confs could then be
> generated.  Think it could make for rather easy
> design/implementation of clusters that could get
> rather complicated.  I can do drag/drop and
> interpolate into templates in javascript, so I could
> envision a rather simple page.
> 
> Maybe we could even store the cluster setup in xml,
> and have a script that takes the xml and draws the
> cluster.  Then when people report slowness or the
> like, they could also post their cluster setup.
> 
> I think when users come to nutch, they come with a
> set
> of boxes.  I think it would be nice for them to see
> what has worked for such a set of boxes in the past
> and be able to easily implement such a strategy. 
> Kind
> of the one hour from download to spidering vision.
> 
> Just a few thoughts.
> 
> Earl
> 
> 
> 	
> 		
> __________________________________ 
> Yahoo! Mail - PC Magazine Editors' Choice 2005 
> http://mail.yahoo.com
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com