You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <ku...@apache.org> on 2007/08/01 01:52:21 UTC

Re: Nutch and distributed searching (w/ apologies)

It is not a problem to contact me directly if you have questions. I am 
going to include this post on the mailing list as well in case other 
people have similar questions.

When we originally started (and back when I wrote the tutorial), I 
thought the best approache would be to have a single massive segments, 
crawldb, linkdb, and indexes on the dfs.  And if we had this then we 
would need an index splitter so we split those massive databases to 
have x number of urls on each search server.  The problem with this 
approach though is that is doesn't scale very well (beyond about 50M 
pages).  You have to keep merging whatever you are crawling into your 
master and after a while this takes a good deal of time to sort, merge 
continually index.

The approach we are using these days is focused on smaller distributed 
segments and hence indexes.  Here is how it works:

1) Inject your database with a beginning url list and fetch those pages.
2) Update a single master crawl db (at this point you only have one).
3) Do a generate with a -topN option to get the best urls to fetch.  Do 
this for the number of urls you want on each search server.  A good rule 
of thumb in no more than 2-3 million pages per disk for searching (this 
is for web search engines).  So lets say your crawldb once updated from 
the first run has > 2 million urls, you would do a generate with -topN 
2000000.
4) Fetch this new segment through the fetch command.
5) Update the single master crawldb with this new segment.
6) Create a single master linkdb (at this point you will only have one) 
through the invertlinks command.
7) Index that single fetched segment.
8) Use a script, etc. to push the single index, segments, and linkdb to 
a search server directory from the dfs.
9) do steps 3-8 for as many search servers as you have. When you reach 
the number of search servers you have you can replace the indexes, etc. 
on the first, second, etc. search servers with new fetch cycles.  This 
way your index always has the best pages for the number of servers and 
amount of space you have.

Once you have a linkdb created, meaning the second or greater fetch, 
then you would create a linkdb for just the single segments and then use 
the mergelinkdb command to merge the single into the master linkdb.

When pushing the pieces to search servers you can move the entire 
linkdb, but after a while that is going to get big.  A better way is to 
write a map reduce job that will split the linkdb to only include urls 
for the single segment that you have fetched.  Then you would only move 
that single linkdb piece out, not the entire master linkdb.  If you want 
to get started quick though just copy the entire linkdb to each search 
server.

This approach assumes that you have a search website fronting multiple 
search servers (search-servers.txt) and that you can bring down a single 
search server, update the index and pieces, and then bring the single 
search server back up.  This way the entire index is never down.

Hope this helps and let me know if you have any questions.

Dennis Kubes

Re: Nutch and distributed searching (w/ apologies)

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On 8/1/07, Dennis Kubes <ku...@apache.org> wrote:
> I am currently writing a python script to automate this whole process
> from inject to pushing out to search servers.  It should be done in a
> day or two and I will post it on the wiki.

(it is a bit of shameless self-promotion but here it goes:)

I hope NUTCH-442 will help with managing indexes. Patch at NUTCH-442
allows you to have Solr servers return search results (also allows you
to launch seperate segment servers that return summaries) so that you
can start Solr servers instead of nutch's index servers and use them
to return search results. Since you can update Solr online, you can
'switch' to new segments without any downtime.

(This is not completely true. Segment servers do not pick up new
segments without a restart yet. But this is easy to fix)

>
> Dennis Kubes
>
> charlie w wrote:
> > Thanks very much for the extended reply; lots of food for thought.
> >
> > WRT the merge/index time on a large index, I kind of suspected this might be
> > the case.  It's already taking a bit of time (albeit on a weak box) with my
> > relatively small index.  In general the approach you outline sounds like
> > something I intuitively thought might need to be done, but had no
> > real experience to justify that intuition.
> >
> > So if I understand you correctly, each iteration of fetching winds up on a
> > separate search server, and you're not doing any merging of segments?
> >
> > When you eventually get around to recrawling a particular page, do you wind
> > up with problems if that page exists in two separate indexes on two separate
> > search servers?  For example, we fetch www.foo.com, and that page goes into
> > the index on search server 1.  Then, 35 days later, we go back to crawl
> > www.foo.com, and this time it winds up in the index on search server 2.
> > Wouldn't the two search servers return the same page as a hit to a search?
> > If not, what prevents that from being an issue?
>
> You can do a dedeup of results on the search itself.  So yes there are
> duplicates in the different index segments, but you will always be
> returning the "best" pages to the user.
> >
> > It also seems that I must be missing something regarding new pages.  If, as
> > in step 9, you are replacing the index on a search server, wouldn't you
> > possibly create the effect of removing documents from the index?  Say you
> > have the same 2 search servers, but do 10 iterations of fetching as a
> > "depth" of crawl.  Wouldn't you be replacing the documents in search server
> > 1 several times over the course of those 10 iterations?
>
> No because you are updating a single master crawldb and on the next
> iteration it wouldn't grab the same pages, it would grab the next best n
> pages.
>
> >
> > Once again, thanks.
> >
> > - Charlie
> >
> >
> > On 7/31/07, Dennis Kubes <ku...@apache.org> wrote:
> >> It is not a problem to contact me directly if you have questions. I am
> >> going to include this post on the mailing list as well in case other
> >> people have similar questions.
> >>
> >> When we originally started (and back when I wrote the tutorial), I
> >> thought the best approache would be to have a single massive segments,
> >> crawldb, linkdb, and indexes on the dfs.  And if we had this then we
> >> would need an index splitter so we split those massive databases to
> >> have x number of urls on each search server.  The problem with this
> >> approach though is that is doesn't scale very well (beyond about 50M
> >> pages).  You have to keep merging whatever you are crawling into your
> >> master and after a while this takes a good deal of time to sort, merge
> >> continually index.
> >>
> >> The approach we are using these days is focused on smaller distributed
> >> segments and hence indexes.  Here is how it works:
> >>
> >> 1) Inject your database with a beginning url list and fetch those pages.
> >> 2) Update a single master crawl db (at this point you only have one).
> >> 3) Do a generate with a -topN option to get the best urls to fetch.  Do
> >> this for the number of urls you want on each search server.  A good rule
> >> of thumb in no more than 2-3 million pages per disk for searching (this
> >> is for web search engines).  So lets say your crawldb once updated from
> >> the first run has > 2 million urls, you would do a generate with -topN
> >> 2000000.
> >> 4) Fetch this new segment through the fetch command.
> >> 5) Update the single master crawldb with this new segment.
> >> 6) Create a single master linkdb (at this point you will only have one)
> >> through the invertlinks command.
> >> 7) Index that single fetched segment.
> >> 8) Use a script, etc. to push the single index, segments, and linkdb to
> >> a search server directory from the dfs.
> >> 9) do steps 3-8 for as many search servers as you have. When you reach
> >> the number of search servers you have you can replace the indexes, etc.
> >> on the first, second, etc. search servers with new fetch cycles.  This
> >> way your index always has the best pages for the number of servers and
> >> amount of space you have.
> >>
> >> Once you have a linkdb created, meaning the second or greater fetch,
> >> then you would create a linkdb for just the single segments and then use
> >> the mergelinkdb command to merge the single into the master linkdb.
> >>
> >> When pushing the pieces to search servers you can move the entire
> >> linkdb, but after a while that is going to get big.  A better way is to
> >> write a map reduce job that will split the linkdb to only include urls
> >> for the single segment that you have fetched.  Then you would only move
> >> that single linkdb piece out, not the entire master linkdb.  If you want
> >> to get started quick though just copy the entire linkdb to each search
> >> server.
> >>
> >> This approach assumes that you have a search website fronting multiple
> >> search servers (search-servers.txt) and that you can bring down a single
> >> search server, update the index and pieces, and then bring the single
> >> search server back up.  This way the entire index is never down.
> >>
> >> Hope this helps and let me know if you have any questions.
> >>
> >> Dennis Kubes
> >>
> >
>


-- 
Doğacan Güney

Re: Nutch and distributed searching (w/ apologies)

Posted by charlie w <sp...@gmail.com>.
Ah, OK, I get it.  Sadly for me, this precise approach is probably not going
meet my requirements, but it really helps to get me going, and I think a
variation on it will suit me quite well.  I'm very much looking forward to
seeing the script that automates this.

I have one minor quibble with this:


> And yes you may have some duplicates in your indexes but this is taken
> care of in the search itself (there is a dedupField option in
> NutchBean).  Of the duplicates the one with the best score (most
> relevant) should be returned.


If you truly have two versions of the same page (same URL), I can imagine a
scenario where you don't necessarily want the one with the highest score.
If the content has changed, you want the one that was most recently
fetched.  You want the best chance of showing an excerpt from the current
page and scoring the current content against other pages that are also hits.

Many thanks for all your help; it clears up a lot for me.

- Charlie

Re: Nutch and distributed searching (w/ apologies)

Posted by Dennis Kubes <ku...@apache.org>.
Actually no.  Let's say you have 10 machines and hence 10 search 
servers.  You would run through 10 iterations of fetch-index-deploy, one 
to each machine.  Lets say you have 3 million pages per machine so this 
whole system could support a 30 million page index.

Once you deploy to 10 you would want to start over as you don't have any 
more space (machines, etc.).  So you would reset the crawldb (this is a 
special job that simply makes sure that all pages are available for 
fetching and are not filtered by next crawl date).  Then you would run 
the next generate with topN which would grab the next top 3 million urls 
to be fetched again.  This fetch-index-deploy cycle would then replace 
(not overwrite) the deployment on search server 1, then 2,3,... as you 
do more cycles.  This way the best urls would continually rise to the top.

One point is there is no concept of depth, only of top urls to fetch. 
With each cycle we update a single master crawldb so the top urls will 
continually change.  But we are not fetching levels as in the whole web 
crawl tutorial.  While going through the cycle we don't reset the 
crawldb and therefore any pages we have fetched during the run of 
machines wouldn't get fetched again until we reset the crawldb after all 
machines have been deployed and we start the whole cycle over again.

And yes you may have some duplicates in your indexes but this is taken 
care of in the search itself (there is a dedupField option in 
NutchBean).  Of the duplicates the one with the best score (most 
relevant) should be returned.

This whole process is continuous and would just keep running until you 
tell it to stop.  The search would never be fully down as only a single 
search server would be down at once and only for a few seconds while the 
database files are replaced.  And you would continually get the best 
urls in your index for the space you have.  I imagine that this is very 
similar to how the google dance works.

Dennis Kubes

charlie w wrote:
> On 8/1/07, Dennis Kubes <ku...@apache.org> wrote:
>> I am currently writing a python script to automate this whole process
>> from inject to pushing out to search servers.  It should be done in a
>> day or two and I will post it on the wiki.
> 
> 
> I'm very much looking forward to this.  Reading the code always helps make
> it concrete to me.
> 
> You can do a dedeup of results on the search itself.  So yes there are
>> duplicates in the different index segments, but you will always be
>> returning the "best" pages to the user.
> 
> 
> I get it; so dedup based on the timestamp of each version of the document
> with a particular URL that was a hit.
> 
>>> It also seems that I must be missing something regarding new pages.  If,
>> as
>>> in step 9, you are replacing the index on a search server, wouldn't you
>>> possibly create the effect of removing documents from the index?  Say
>> you
>>> have the same 2 search servers, but do 10 iterations of fetching as a
>>> "depth" of crawl.  Wouldn't you be replacing the documents in search
>> server
>>> 1 several times over the course of those 10 iterations?
>> No because you are updating a single master crawldb and on the next
>> iteration it wouldn't grab the same pages, it would grab the next best n
>> pages.
> 
> 
> I had the impression you were overwriting the index on the search servers
> with the segment and index from the new iteration of fetching.  Meaning in
> my 2 search server example, iteration 3 of fetching would overwrite
> the index built by iteration 1 of fetching (they'd both wind up on search
> server 1).  But instead, you're actually merging the results of iteration 3
> into the search server's existing index from iteration 1, rather than
> replacing the entire index?
> 
> - C
> 

Re: Nutch and distributed searching (w/ apologies)

Posted by charlie w <sp...@gmail.com>.
On 8/1/07, Dennis Kubes <ku...@apache.org> wrote:
>
> I am currently writing a python script to automate this whole process
> from inject to pushing out to search servers.  It should be done in a
> day or two and I will post it on the wiki.


I'm very much looking forward to this.  Reading the code always helps make
it concrete to me.

You can do a dedeup of results on the search itself.  So yes there are
> duplicates in the different index segments, but you will always be
> returning the "best" pages to the user.


I get it; so dedup based on the timestamp of each version of the document
with a particular URL that was a hit.

>
> > It also seems that I must be missing something regarding new pages.  If,
> as
> > in step 9, you are replacing the index on a search server, wouldn't you
> > possibly create the effect of removing documents from the index?  Say
> you
> > have the same 2 search servers, but do 10 iterations of fetching as a
> > "depth" of crawl.  Wouldn't you be replacing the documents in search
> server
> > 1 several times over the course of those 10 iterations?
>
> No because you are updating a single master crawldb and on the next
> iteration it wouldn't grab the same pages, it would grab the next best n
> pages.


I had the impression you were overwriting the index on the search servers
with the segment and index from the new iteration of fetching.  Meaning in
my 2 search server example, iteration 3 of fetching would overwrite
the index built by iteration 1 of fetching (they'd both wind up on search
server 1).  But instead, you're actually merging the results of iteration 3
into the search server's existing index from iteration 1, rather than
replacing the entire index?

- C

Re: Nutch and distributed searching (w/ apologies)

Posted by Dennis Kubes <ku...@apache.org>.
I am currently writing a python script to automate this whole process 
from inject to pushing out to search servers.  It should be done in a 
day or two and I will post it on the wiki.

Dennis Kubes

charlie w wrote:
> Thanks very much for the extended reply; lots of food for thought.
> 
> WRT the merge/index time on a large index, I kind of suspected this might be
> the case.  It's already taking a bit of time (albeit on a weak box) with my
> relatively small index.  In general the approach you outline sounds like
> something I intuitively thought might need to be done, but had no
> real experience to justify that intuition.
> 
> So if I understand you correctly, each iteration of fetching winds up on a
> separate search server, and you're not doing any merging of segments?
> 
> When you eventually get around to recrawling a particular page, do you wind
> up with problems if that page exists in two separate indexes on two separate
> search servers?  For example, we fetch www.foo.com, and that page goes into
> the index on search server 1.  Then, 35 days later, we go back to crawl
> www.foo.com, and this time it winds up in the index on search server 2.
> Wouldn't the two search servers return the same page as a hit to a search?
> If not, what prevents that from being an issue?

You can do a dedeup of results on the search itself.  So yes there are 
duplicates in the different index segments, but you will always be 
returning the "best" pages to the user.
> 
> It also seems that I must be missing something regarding new pages.  If, as
> in step 9, you are replacing the index on a search server, wouldn't you
> possibly create the effect of removing documents from the index?  Say you
> have the same 2 search servers, but do 10 iterations of fetching as a
> "depth" of crawl.  Wouldn't you be replacing the documents in search server
> 1 several times over the course of those 10 iterations?

No because you are updating a single master crawldb and on the next 
iteration it wouldn't grab the same pages, it would grab the next best n 
pages.

> 
> Once again, thanks.
> 
> - Charlie
> 
> 
> On 7/31/07, Dennis Kubes <ku...@apache.org> wrote:
>> It is not a problem to contact me directly if you have questions. I am
>> going to include this post on the mailing list as well in case other
>> people have similar questions.
>>
>> When we originally started (and back when I wrote the tutorial), I
>> thought the best approache would be to have a single massive segments,
>> crawldb, linkdb, and indexes on the dfs.  And if we had this then we
>> would need an index splitter so we split those massive databases to
>> have x number of urls on each search server.  The problem with this
>> approach though is that is doesn't scale very well (beyond about 50M
>> pages).  You have to keep merging whatever you are crawling into your
>> master and after a while this takes a good deal of time to sort, merge
>> continually index.
>>
>> The approach we are using these days is focused on smaller distributed
>> segments and hence indexes.  Here is how it works:
>>
>> 1) Inject your database with a beginning url list and fetch those pages.
>> 2) Update a single master crawl db (at this point you only have one).
>> 3) Do a generate with a -topN option to get the best urls to fetch.  Do
>> this for the number of urls you want on each search server.  A good rule
>> of thumb in no more than 2-3 million pages per disk for searching (this
>> is for web search engines).  So lets say your crawldb once updated from
>> the first run has > 2 million urls, you would do a generate with -topN
>> 2000000.
>> 4) Fetch this new segment through the fetch command.
>> 5) Update the single master crawldb with this new segment.
>> 6) Create a single master linkdb (at this point you will only have one)
>> through the invertlinks command.
>> 7) Index that single fetched segment.
>> 8) Use a script, etc. to push the single index, segments, and linkdb to
>> a search server directory from the dfs.
>> 9) do steps 3-8 for as many search servers as you have. When you reach
>> the number of search servers you have you can replace the indexes, etc.
>> on the first, second, etc. search servers with new fetch cycles.  This
>> way your index always has the best pages for the number of servers and
>> amount of space you have.
>>
>> Once you have a linkdb created, meaning the second or greater fetch,
>> then you would create a linkdb for just the single segments and then use
>> the mergelinkdb command to merge the single into the master linkdb.
>>
>> When pushing the pieces to search servers you can move the entire
>> linkdb, but after a while that is going to get big.  A better way is to
>> write a map reduce job that will split the linkdb to only include urls
>> for the single segment that you have fetched.  Then you would only move
>> that single linkdb piece out, not the entire master linkdb.  If you want
>> to get started quick though just copy the entire linkdb to each search
>> server.
>>
>> This approach assumes that you have a search website fronting multiple
>> search servers (search-servers.txt) and that you can bring down a single
>> search server, update the index and pieces, and then bring the single
>> search server back up.  This way the entire index is never down.
>>
>> Hope this helps and let me know if you have any questions.
>>
>> Dennis Kubes
>>
> 

Re: Nutch and distributed searching (w/ apologies)

Posted by charlie w <sp...@gmail.com>.
Thanks very much for the extended reply; lots of food for thought.

WRT the merge/index time on a large index, I kind of suspected this might be
the case.  It's already taking a bit of time (albeit on a weak box) with my
relatively small index.  In general the approach you outline sounds like
something I intuitively thought might need to be done, but had no
real experience to justify that intuition.

So if I understand you correctly, each iteration of fetching winds up on a
separate search server, and you're not doing any merging of segments?

When you eventually get around to recrawling a particular page, do you wind
up with problems if that page exists in two separate indexes on two separate
search servers?  For example, we fetch www.foo.com, and that page goes into
the index on search server 1.  Then, 35 days later, we go back to crawl
www.foo.com, and this time it winds up in the index on search server 2.
Wouldn't the two search servers return the same page as a hit to a search?
If not, what prevents that from being an issue?

It also seems that I must be missing something regarding new pages.  If, as
in step 9, you are replacing the index on a search server, wouldn't you
possibly create the effect of removing documents from the index?  Say you
have the same 2 search servers, but do 10 iterations of fetching as a
"depth" of crawl.  Wouldn't you be replacing the documents in search server
1 several times over the course of those 10 iterations?

Once again, thanks.

- Charlie


On 7/31/07, Dennis Kubes <ku...@apache.org> wrote:
>
> It is not a problem to contact me directly if you have questions. I am
> going to include this post on the mailing list as well in case other
> people have similar questions.
>
> When we originally started (and back when I wrote the tutorial), I
> thought the best approache would be to have a single massive segments,
> crawldb, linkdb, and indexes on the dfs.  And if we had this then we
> would need an index splitter so we split those massive databases to
> have x number of urls on each search server.  The problem with this
> approach though is that is doesn't scale very well (beyond about 50M
> pages).  You have to keep merging whatever you are crawling into your
> master and after a while this takes a good deal of time to sort, merge
> continually index.
>
> The approach we are using these days is focused on smaller distributed
> segments and hence indexes.  Here is how it works:
>
> 1) Inject your database with a beginning url list and fetch those pages.
> 2) Update a single master crawl db (at this point you only have one).
> 3) Do a generate with a -topN option to get the best urls to fetch.  Do
> this for the number of urls you want on each search server.  A good rule
> of thumb in no more than 2-3 million pages per disk for searching (this
> is for web search engines).  So lets say your crawldb once updated from
> the first run has > 2 million urls, you would do a generate with -topN
> 2000000.
> 4) Fetch this new segment through the fetch command.
> 5) Update the single master crawldb with this new segment.
> 6) Create a single master linkdb (at this point you will only have one)
> through the invertlinks command.
> 7) Index that single fetched segment.
> 8) Use a script, etc. to push the single index, segments, and linkdb to
> a search server directory from the dfs.
> 9) do steps 3-8 for as many search servers as you have. When you reach
> the number of search servers you have you can replace the indexes, etc.
> on the first, second, etc. search servers with new fetch cycles.  This
> way your index always has the best pages for the number of servers and
> amount of space you have.
>
> Once you have a linkdb created, meaning the second or greater fetch,
> then you would create a linkdb for just the single segments and then use
> the mergelinkdb command to merge the single into the master linkdb.
>
> When pushing the pieces to search servers you can move the entire
> linkdb, but after a while that is going to get big.  A better way is to
> write a map reduce job that will split the linkdb to only include urls
> for the single segment that you have fetched.  Then you would only move
> that single linkdb piece out, not the entire master linkdb.  If you want
> to get started quick though just copy the entire linkdb to each search
> server.
>
> This approach assumes that you have a search website fronting multiple
> search servers (search-servers.txt) and that you can bring down a single
> search server, update the index and pieces, and then bring the single
> search server back up.  This way the entire index is never down.
>
> Hope this helps and let me know if you have any questions.
>
> Dennis Kubes
>