You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Shawn Gervais <pr...@project10.net> on 2006/04/15 08:57:37 UTC

Using Nutch's distributed search server mode

Greetings list,

I am attempting to use Nutch's distributed search server mode and seeing 
some unexpected results. Searches take ages to execute -- I seem to have 
caused Nutch to perform the same search 16 times (I have 16 nodes).

Over the past week, I have been building my indexes:

$ bin/hadoop dfs -ls segments/
/user/nutch/segments/20060406061358     <dir> (~100k pages, first run)
/user/nutch/segments/20060411165547     <dir> (1M pages)
/user/nutch/segments/20060412214204     <dir> (2M pages)
/user/nutch/segments/20060413004057     <dir> (5M pages)


I then indexed them all into a single index. Here is my current DFS "du" 
listing:

/user/nutch/crawldb     6379003931
/user/nutch/indexes     12895240115 (index built from all above segs.)
/user/nutch/indexes_old 400611137   (index built from smallest seg.)
/user/nutch/linkdb      107951330
/user/nutch/segments    67746176573

I built my big index by running a command similar to:

$ bin/nutch index indexes crawldb linkdb segments/*

However, as DFS doesn't seem to support wildcards, or my shell was 
usurping them, I was forced to specify each segment manually.

After building my index I proceeded to setup my distributed search 
servers per Stefan's excellent wiki, at 
http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever. 
I was not able to use the literal instructions, as my indexes and 
segments are in DFS while the document presumes a local filesystem 
installation, and I was also not able to "partition" my indexes or 
segment by host. I don't know how to do that.

When I examine Tomcat's catalina.out log, as well as the logs of the 
distributed search servers themselves, I see some odd behavior:

060415 011943 29 query request from 10.10.0.6
060415 011943 29 query: baby
060415 011943 29 searching for 20 raw hits
060415 011950 29 re-searching for 40 raw hits, query: baby 
-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"
060415 011958 29 found 2741775 raw hits
060415 011958 29 re-searching for 80 raw hits, query: baby 
-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk" 
-site:"ffcembroidery.com" -site:"infobluebook.com" 
-site:"aaliyahlova.suddenlaunch.com"
060415 012006 29 found 2734890 raw hits
060415 012007 29 total hits: 2754135

I'm not sure why it is re-searching using a refactored query. Huh? I 
don't see this behavior when there is one search server, instead of the 
16 I am using now. As you can see, the query is unacceptably slow.

When I examine the search results I see many duplicate results. Looking 
at it further it seems like the results of performing the same search 
across all 16 nodes is being combined into one result set - duplicates 
and all. I can only assume that I need to somehow partition my index or 
segments, but I'm unsure how to do that.

I guess I need to take my master index and set of segments and split 
them into 16 equal parts, and copy (?) those to their respective nodes. 
It seems onerous and wasteful - I will be duplicating data that is 
already in DFS. Am I wrong?

Thanks to anyone who read this far ;)

-Shawn

Re: Nutch shows same results multiple times.

Posted by Dima Mazmanov <nu...@proservice.ge>.

Well my script already contains this command....



> Run bin/nutch dedup segments dedup.tmp
> 
> 
> Dima Mazmanov wrote:
>> Hi all!! I'm running on nutch-0.7.1.
>>
>> Here is result of my search.
>>
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.org/RootPages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.com/rootpages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.com/RootPages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.org/rootpages/Default.aspx (Cached)
>> As you can see one result is shown multiple times.
>> Why so? What is the difference between these links? I don't see any..
>> So, how can I avoid this problem?
>> Thanks, Regards, Dima
>>
>>

Re: Nutch shows same results multiple times.

Posted by "Håvard W. Kongsgård" <h....@niap.no>.

Run bin/nutch dedup segments dedup.tmp


Dima Mazmanov wrote:
> Hi all!! I'm running on nutch-0.7.1.
>
> Here is result of my search.
>
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.org/RootPages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.com/rootpages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.com/RootPages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.org/rootpages/Default.aspx (Cached)
> As you can see one result is shown multiple times.
> Why so? What is the difference between these links? I don't see any..
> So, how can I avoid this problem?
> Thanks, Regards, Dima
>
>

Nutch shows same results multiple times.

Posted by Dima Mazmanov <nu...@proservice.ge>.

Hi all!! 
I'm running on nutch-0.7.1.

Here is result of my search.

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.org/RootPages/Default.aspx (Cached) 

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.com/rootpages/Default.aspx (Cached) 

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.com/RootPages/Default.aspx (Cached) 

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.org/rootpages/Default.aspx (Cached) 

As you can see one result is shown multiple times.
Why so? 
What is the difference between these links? I don't see any..
So, how can I avoid this problem?
Thanks, 
Regards, Dima

Re: Using Nutch's distributed search server mode

Posted by Ken Krugler <kk...@transpac.com>.

>Doug Cutting wrote:
>>Shawn Gervais wrote:
>>>I was not able to use the literal instructions, as my indexes and 
>>>segments are in DFS while the document presumes a local filesystem 
>>>installation
>>
>>Search performance is not good with DFS-based indexes & segments. 
>>This is not recommended.
>
>Yeah, I figured - ignoring network overhead it seems that it would 
>prevent the OS from caching disk pages, no?
>
>>Distributed search is not meant for a single merged index, but 
>>rather for searching multiple indexes.  With distributed search, 
>>each node will typically have (a local copy of) a few segments and 
>>either a merged index for just those segments, or separate indexes 
>>for each segment.
>
>What is the best way to maintain an operational fetch/index and 
>search cluster? It seems that it would help to have a tool that was 
>able to partition existing segments and indexes and export those to 
>the local filesystems of the slave nodes.
>
>Should I coordinate my fetches and indexing so that the resultant 
>segments/indexes are optimal for each of my slave nodes? How do 
>others handle dissimilar search slave nodes?

I'm not sure exactly what you mean by "dissimlar search slave nodes".

But I think our situation is similar. We have a cluster used for 
crawling, and a cluster used for distributed searching.

We use scripts to extract groups of segments from the Hadoop DFS to a 
local drive, merge/index them, then set up a distributed search 
server. The appropriate size of each segment group depends on the # 
of docs you want to be serving up from each search server - in our 
case, I think it's about 10M or so. Obviously this varies depending 
on the amount of RAM/horsepower you have on the server, and your 
target query performance.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: Using Nutch's distributed search server mode

Posted by Shawn Gervais <pr...@project10.net>.

Doug Cutting wrote:
> Shawn Gervais wrote:
>> I was not able to use the literal instructions, as my indexes and 
>> segments are in DFS while the document presumes a local filesystem 
>> installation
> 
> Search performance is not good with DFS-based indexes & segments.  This 
> is not recommended.

Yeah, I figured - ignoring network overhead it seems that it would 
prevent the OS from caching disk pages, no?

> Distributed search is not meant for a single merged index, but rather 
> for searching multiple indexes.  With distributed search, each node will 
> typically have (a local copy of) a few segments and either a merged 
> index for just those segments, or separate indexes for each segment.

What is the best way to maintain an operational fetch/index and search 
cluster? It seems that it would help to have a tool that was able to 
partition existing segments and indexes and export those to the local 
filesystems of the slave nodes.

Should I coordinate my fetches and indexing so that the resultant 
segments/indexes are optimal for each of my slave nodes? How do others 
handle dissimilar search slave nodes?

Regards,
-Shawn

Re: Using Nutch's distributed search server mode

Posted by Doug Cutting <cu...@apache.org>.

Shawn Gervais wrote:
> I was not able to use the literal instructions, as my indexes and 
> segments are in DFS while the document presumes a local filesystem 
> installation

Search performance is not good with DFS-based indexes & segments.  This 
is not recommended.

Distributed search is not meant for a single merged index, but rather 
for searching multiple indexes.  With distributed search, each node will 
typically have (a local copy of) a few segments and either a merged 
index for just those segments, or separate indexes for each segment.

> When I examine the search results I see many duplicate results. Looking 
> at it further it seems like the results of performing the same search 
> across all 16 nodes is being combined into one result set - duplicates 
> and all. I can only assume that I need to somehow partition my index or 
> segments, but I'm unsure how to do that.

It looks like you're searching the same dfs-resident index 16 times.

Doug