You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by charlie w <sp...@gmail.com> on 2007/07/31 19:06:14 UTC

spliting an index

With regard to distributed search I see lots of discussion about splitting
the index, but no actual discussion about specifically how that's done.

I have a small, but growing, index.  Is it possible to split my existing
index, and if so, how?

Re: spliting an index

Posted by Dennis Kubes <ku...@apache.org>.
Short answer, not way to do it currently.  Now for the long answer.

You can handle searching in two ways:

1) Have a single massive index and segments, merge everything including 
segments and indexes.  Then split the indexes and segments (don't forget 
  having to split the segments otherwise you would have all content from 
all segments on each distributed search server machine), and put an 
index of a given size/number of urls on each search server.

This doesn't work for a couple of different reasons.  The first being 
once the data gets to big you are going to be spending a massive amount 
of time merging, sorting, and splitting.  Two, to split an already 
created index you would need a program (and programmer) that understood 
the internals of lucene serialization because non-stored fields don't 
get though serialization and so copying lucene documents from one index 
to another won't work.

2) Crawl x number of urls, say 10 million.  Process those urls, 
including updatedb, and any global program processing such as LinkRank. 
    While you would have different segments and index (shards) you would 
probably have a single master crawldb.  Index just that crawled 
segments.  No merging takes place.  Deploy that index and its segments 
out to a given search server.  Rinse, lather, repeat with as many search 
servers as you have.

Some caveats.  Crawl the highest scoring urls first then move down to 
lower and lowest.  For example first 10m are best scoring according to 
LinkRank, next 10M are a little lower scoring, etc.  As you move down 
the crawl will take longer and longer because you will have more "bad" 
urls.  On some crawls we have seen as high as 50% bad urls.

Once you get to the number of search servers you have, you reset the 
crawldb crawled times (not scores) and start over completely with your 
crawl.  With each search shard (indexes and segments), when completed 
and ready to be deployed, you would bring down that single search 
server, do the deployment, and bring it back up.  The entire search 
always stays up you are just always replacing shards continuously in the 
background.

Dennis



charlie w wrote:
> With regard to distributed search I see lots of discussion about splitting
> the index, but no actual discussion about specifically how that's done.
> 
> I have a small, but growing, index.  Is it possible to split my existing
> index, and if so, how?
> 

Re: spliting an index

Posted by lei wang <nu...@gmail.com>.
I agree with you  that we should spilit up index at the stage of
indexing. We are thinking on the same page. Maybe we can read   index file
directory and segements directory  in nutch api, and spilt segements file
dir by documents, and build index on each segements file?
nutch claim  that it a distributed search engine, it is  too  *clumsy to
depoly lager-scale index distributed.  *

On Wed, Jun 17, 2009 at 1:17 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> from the top of my head I am afraid this in no direct solution. Index can
> be
> split at the stage of indexing by setting the parameter number of URLs in
> an
> index but you also have  segments which store page content. And you cannot
> alter them after you have done indexing.
>
> Probably the solution might be to split up segments first and then index
> each part separately but this approach would require certain amount of
> development.
>
> Correct me if I am wrong.
>
> 2009/6/17 beyiwork <nu...@gmail.com>
>
> >
> > i am considering this problem now , anyone can help?
> >
> > charlie w wrote:
> > >
> > > With regard to distributed search I see lots of discussion about
> > splitting
> > > the index, but no actual discussion about specifically how that's done.
> > >
> > > I have a small, but growing, index.  Is it possible to split my
> existing
> > > index, and if so, how?
> > >
> > >
> >
> > --
> > View this message in context:
> > http://www.nabble.com/spliting-an-index-tp11928411p24066856.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>

Re: spliting an index

Posted by Alexander Aristov <al...@gmail.com>.
from the top of my head I am afraid this in no direct solution. Index can be
split at the stage of indexing by setting the parameter number of URLs in an
index but you also have  segments which store page content. And you cannot
alter them after you have done indexing.

Probably the solution might be to split up segments first and then index
each part separately but this approach would require certain amount of
development.

Correct me if I am wrong.

2009/6/17 beyiwork <nu...@gmail.com>

>
> i am considering this problem now , anyone can help?
>
> charlie w wrote:
> >
> > With regard to distributed search I see lots of discussion about
> splitting
> > the index, but no actual discussion about specifically how that's done.
> >
> > I have a small, but growing, index.  Is it possible to split my existing
> > index, and if so, how?
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/spliting-an-index-tp11928411p24066856.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: spliting an index

Posted by beyiwork <nu...@gmail.com>.
i am considering this problem now , anyone can help?

charlie w wrote:
> 
> With regard to distributed search I see lots of discussion about splitting
> the index, but no actual discussion about specifically how that's done.
> 
> I have a small, but growing, index.  Is it possible to split my existing
> index, and if so, how?
> 
> 

-- 
View this message in context: http://www.nabble.com/spliting-an-index-tp11928411p24066856.html
Sent from the Nutch - User mailing list archive at Nabble.com.