You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Shawn Gervais <pr...@project10.net> on 2006/04/15 08:57:37 UTC
Using Nutch's distributed search server mode
Greetings list,
I am attempting to use Nutch's distributed search server mode and seeing
some unexpected results. Searches take ages to execute -- I seem to have
caused Nutch to perform the same search 16 times (I have 16 nodes).
Over the past week, I have been building my indexes:
$ bin/hadoop dfs -ls segments/
/user/nutch/segments/20060406061358 <dir> (~100k pages, first run)
/user/nutch/segments/20060411165547 <dir> (1M pages)
/user/nutch/segments/20060412214204 <dir> (2M pages)
/user/nutch/segments/20060413004057 <dir> (5M pages)
I then indexed them all into a single index. Here is my current DFS "du"
listing:
/user/nutch/crawldb 6379003931
/user/nutch/indexes 12895240115 (index built from all above segs.)
/user/nutch/indexes_old 400611137 (index built from smallest seg.)
/user/nutch/linkdb 107951330
/user/nutch/segments 67746176573
I built my big index by running a command similar to:
$ bin/nutch index indexes crawldb linkdb segments/*
However, as DFS doesn't seem to support wildcards, or my shell was
usurping them, I was forced to specify each segment manually.
After building my index I proceeded to setup my distributed search
servers per Stefan's excellent wiki, at
http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever.
I was not able to use the literal instructions, as my indexes and
segments are in DFS while the document presumes a local filesystem
installation, and I was also not able to "partition" my indexes or
segment by host. I don't know how to do that.
When I examine Tomcat's catalina.out log, as well as the logs of the
distributed search servers themselves, I see some odd behavior:
060415 011943 29 query request from 10.10.0.6
060415 011943 29 query: baby
060415 011943 29 searching for 20 raw hits
060415 011950 29 re-searching for 40 raw hits, query: baby
-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"
060415 011958 29 found 2741775 raw hits
060415 011958 29 re-searching for 80 raw hits, query: baby
-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"
-site:"ffcembroidery.com" -site:"infobluebook.com"
-site:"aaliyahlova.suddenlaunch.com"
060415 012006 29 found 2734890 raw hits
060415 012007 29 total hits: 2754135
I'm not sure why it is re-searching using a refactored query. Huh? I
don't see this behavior when there is one search server, instead of the
16 I am using now. As you can see, the query is unacceptably slow.
When I examine the search results I see many duplicate results. Looking
at it further it seems like the results of performing the same search
across all 16 nodes is being combined into one result set - duplicates
and all. I can only assume that I need to somehow partition my index or
segments, but I'm unsure how to do that.
I guess I need to take my master index and set of segments and split
them into 16 equal parts, and copy (?) those to their respective nodes.
It seems onerous and wasteful - I will be duplicating data that is
already in DFS. Am I wrong?
Thanks to anyone who read this far ;)
-Shawn
Re: Nutch shows same results multiple times.
Posted by Dima Mazmanov <nu...@proservice.ge>.
Well my script already contains this command....
> Run bin/nutch dedup segments dedup.tmp
>
>
> Dima Mazmanov wrote:
>> Hi all!! I'm running on nutch-0.7.1.
>>
>> Here is result of my search.
>>
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
>> Site Our web site has new look and ... link on the ...
>> http://www.argosoft.org/RootPages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
>> Site Our web site has new look and ... link on the ...
>> http://www.argosoft.com/rootpages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
>> Site Our web site has new look and ... link on the ...
>> http://www.argosoft.com/RootPages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
>> Site Our web site has new look and ... link on the ...
>> http://www.argosoft.org/rootpages/Default.aspx (Cached)
>> As you can see one result is shown multiple times.
>> Why so? What is the difference between these links? I don't see any..
>> So, how can I avoid this problem?
>> Thanks, Regards, Dima
>>
>>
Re: Nutch shows same results multiple times.
Posted by "Håvard W. Kongsgård" <h....@niap.no>.
Run bin/nutch dedup segments dedup.tmp
Dima Mazmanov wrote:
> Hi all!! I'm running on nutch-0.7.1.
>
> Here is result of my search.
>
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
> Site Our web site has new look and ... link on the ...
> http://www.argosoft.org/RootPages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
> Site Our web site has new look and ... link on the ...
> http://www.argosoft.com/rootpages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
> Site Our web site has new look and ... link on the ...
> http://www.argosoft.com/RootPages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web
> Site Our web site has new look and ... link on the ...
> http://www.argosoft.org/rootpages/Default.aspx (Cached)
> As you can see one result is shown multiple times.
> Why so? What is the difference between these links? I don't see any..
> So, how can I avoid this problem?
> Thanks, Regards, Dima
>
>
Nutch shows same results multiple times.
Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi all!!
I'm running on nutch-0.7.1.
Here is result of my search.
ArGo Software Design Homepage
[html] - 30.2 k -
... Look of our Web Site Our web site has new look and ... link on the ...
http://www.argosoft.org/RootPages/Default.aspx (Cached)
ArGo Software Design Homepage
[html] - 30.2 k -
... Look of our Web Site Our web site has new look and ... link on the ...
http://www.argosoft.com/rootpages/Default.aspx (Cached)
ArGo Software Design Homepage
[html] - 30.2 k -
... Look of our Web Site Our web site has new look and ... link on the ...
http://www.argosoft.com/RootPages/Default.aspx (Cached)
ArGo Software Design Homepage
[html] - 30.2 k -
... Look of our Web Site Our web site has new look and ... link on the ...
http://www.argosoft.org/rootpages/Default.aspx (Cached)
As you can see one result is shown multiple times.
Why so?
What is the difference between these links? I don't see any..
So, how can I avoid this problem?
Thanks,
Regards, Dima
Re: Using Nutch's distributed search server mode
Posted by Ken Krugler <kk...@transpac.com>.
>Doug Cutting wrote:
>>Shawn Gervais wrote:
>>>I was not able to use the literal instructions, as my indexes and
>>>segments are in DFS while the document presumes a local filesystem
>>>installation
>>
>>Search performance is not good with DFS-based indexes & segments.
>>This is not recommended.
>
>Yeah, I figured - ignoring network overhead it seems that it would
>prevent the OS from caching disk pages, no?
>
>>Distributed search is not meant for a single merged index, but
>>rather for searching multiple indexes. With distributed search,
>>each node will typically have (a local copy of) a few segments and
>>either a merged index for just those segments, or separate indexes
>>for each segment.
>
>What is the best way to maintain an operational fetch/index and
>search cluster? It seems that it would help to have a tool that was
>able to partition existing segments and indexes and export those to
>the local filesystems of the slave nodes.
>
>Should I coordinate my fetches and indexing so that the resultant
>segments/indexes are optimal for each of my slave nodes? How do
>others handle dissimilar search slave nodes?
I'm not sure exactly what you mean by "dissimlar search slave nodes".
But I think our situation is similar. We have a cluster used for
crawling, and a cluster used for distributed searching.
We use scripts to extract groups of segments from the Hadoop DFS to a
local drive, merge/index them, then set up a distributed search
server. The appropriate size of each segment group depends on the #
of docs you want to be serving up from each search server - in our
case, I think it's about 10M or so. Obviously this varies depending
on the amount of RAM/horsepower you have on the server, and your
target query performance.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Re: Using Nutch's distributed search server mode
Posted by Shawn Gervais <pr...@project10.net>.
Doug Cutting wrote:
> Shawn Gervais wrote:
>> I was not able to use the literal instructions, as my indexes and
>> segments are in DFS while the document presumes a local filesystem
>> installation
>
> Search performance is not good with DFS-based indexes & segments. This
> is not recommended.
Yeah, I figured - ignoring network overhead it seems that it would
prevent the OS from caching disk pages, no?
> Distributed search is not meant for a single merged index, but rather
> for searching multiple indexes. With distributed search, each node will
> typically have (a local copy of) a few segments and either a merged
> index for just those segments, or separate indexes for each segment.
What is the best way to maintain an operational fetch/index and search
cluster? It seems that it would help to have a tool that was able to
partition existing segments and indexes and export those to the local
filesystems of the slave nodes.
Should I coordinate my fetches and indexing so that the resultant
segments/indexes are optimal for each of my slave nodes? How do others
handle dissimilar search slave nodes?
Regards,
-Shawn
Re: Using Nutch's distributed search server mode
Posted by Doug Cutting <cu...@apache.org>.
Shawn Gervais wrote:
> I was not able to use the literal instructions, as my indexes and
> segments are in DFS while the document presumes a local filesystem
> installation
Search performance is not good with DFS-based indexes & segments. This
is not recommended.
Distributed search is not meant for a single merged index, but rather
for searching multiple indexes. With distributed search, each node will
typically have (a local copy of) a few segments and either a merged
index for just those segments, or separate indexes for each segment.
> When I examine the search results I see many duplicate results. Looking
> at it further it seems like the results of performing the same search
> across all 16 nodes is being combined into one result set - duplicates
> and all. I can only assume that I need to somehow partition my index or
> segments, but I'm unsure how to do that.
It looks like you're searching the same dfs-resident index 16 times.
Doug