You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ledio Ago <la...@looksmart.net> on 2005/12/20 00:24:31 UTC

RE: [Nutch-dev] distributed search

I tried separating the Tomcat into a different machine and bingo.
The performance went up by 30%%.  Right now I only have two machines
with 900K URLs each that act as Nutch servers and one machine that hosts the Tomcat.

At this time I don't suspect any more that Tomcat is synchronously requesting
results from each server, even thought I haven't found any documented
evidence anywhere, but based on what I saw with the latest number, I really
doubt it that's happening.

Next thing to do is to use 4 machines and see what the performance is.  I'll
try to split the index into 4 pieces now.

By the way, is there an easy way to split the index I have already have.
I would hate to recrawl all of the 1.9MM URLs again and waste bandwidth.

Thanks,
Ledio

-----Original Message-----
From: Ledio Ago 
Sent: Friday, December 16, 2005 8:21 AM
To: nutch-dev@lucene.apache.org; dev@nutch.org
Cc: nutch-developers@lists.sourceforge.net
Subject: RE: [Nutch-dev] distributed seach


Thank you Stefan for the reply.

I did have seperate physical indexes in seperate machines with about 900K URLs in
each of them.  I run Tomcat in one of those boxes, and tested the load.  I got
the same numbers as I got when I didn't use the distributed search.
So I was suspecting that Tomcat wasn't doing Asynchrounou calls to the nutch
servers, therefore the performace issue.

I'll try versions 0.7 and 0.8 and will see what happens.  Another thing I'll try
is to put Tomcat in a different machine.

Thanks,
Ledio


-----Original Message-----
From: Stefan Groschupf [mailto:sg@media-style.com]
Sent: Fri 16-Dec-05 3:13 AM
To: dev@nutch.org
Cc: nutch-developers@lists.sourceforge.net
Subject: Re: [Nutch-dev] distributed seach
 
Hi Ledio,
the actually nutch is 0.7 or you can also use the 0.8 branch code.
Also you are using old mailing lists and I suggest you use the apache  
nutch user mailing list.
http://lucene.apache.org/nutch/mailing_lists.html
To answer your question, nutch does forward the query to all search  
server and collect the and rerank the results of the search servers.
So give each of your servers a physically split of your index.
This will improve your performance. Also check that the index parts  
are not stored on the same hdd and your search servers have as much  
RAM as possiböe.
HTH
Stefan




Am 16.12.2005 um 03:00 schrieb Ledio Ago:

> I was able to setup nutch searchers in distributed fashion buy  
> creating the search-server.txt files
> at the root of the data where Tomcat was running.  I had a total of  
> 1.9 MM URLs slit in half for
> each searcher.
> I was very surprised to see that the performance numbers I got for  
> this set up was not as good as
> I was expecting.  Before I ran this setup, I run the test in a  
> single searcher with 1.9 MM URLs.
> The results for the distributed setup were the same or even.
>
> One thing that I suspect is that Tomcat is querying each nutch  
> search server synchronously
> instead of asynchronously, by querying each server one at the time,  
> because that would explain a lot.
>
> Can somebody tell me if this is true??
>
> I'm running Nutch 0.5 with very beefy machines.
>
> Thanks,
>
> Ledio



Re: [Nutch-dev] distributed search

Posted by Rafi Iz <ra...@hotmail.com>.
check the next command
FetchListTool (-local | -ndfs <namenode:port>) <db>  <segment_dir> 
[-refetchonly] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] 
[-adddays numDays]

This command call to a function called emitMultipleLists which spit out 
several fetchlists, so that you can fetch across several machines.

e.g.
bin/nutch org.apache.nutch.tools.FetchListTool ......

Rafi


>From: Stefan Groschupf <sg...@media-style.com>
>Reply-To: nutch-dev@lucene.apache.org
>To: nutch-dev@lucene.apache.org
>Subject: Re: [Nutch-dev] distributed search
>Date: Tue, 20 Dec 2005 00:38:22 +0100
>
>>By the way, is there an easy way to split the index I have already  have.
>>I would hate to recrawl all of the 1.9MM URLs again and waste  bandwidth.
>
>Well I do not know any tool that comes with nutch or a other tool  that 
>does it, may there is one.
>But to write a java class that creates two smaller indexes from one  large 
>is very easy, a hour work maximum.
>Just check any of the existing lucene tutorial, lucene java doc or  the 
>lucene book.
>BTW, Erik Hatcher's book "Lucene in action" is a MUST for all nutch  users. 
>:-)
>
>Stefan
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Re: [Nutch-dev] distributed search

Posted by Stefan Groschupf <sg...@media-style.com>.
> By the way, is there an easy way to split the index I have already  
> have.
> I would hate to recrawl all of the 1.9MM URLs again and waste  
> bandwidth.

Well I do not know any tool that comes with nutch or a other tool  
that does it, may there is one.
But to write a java class that creates two smaller indexes from one  
large is very easy, a hour work maximum.
Just check any of the existing lucene tutorial, lucene java doc or  
the lucene book.
BTW, Erik Hatcher's book "Lucene in action" is a MUST for all nutch  
users. :-)

Stefan


GETTING OUT OF MAILING LIST

Posted by "Rolando H. Martinelli - CoBuys, S.A." <ro...@cobuys.com>.
Hi, 

how can I get out of the mailing list?

Regards,
Rolando