You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by charlie w <sp...@gmail.com> on 2007/07/31 19:06:14 UTC
spliting an index
With regard to distributed search I see lots of discussion about splitting
the index, but no actual discussion about specifically how that's done.
I have a small, but growing, index. Is it possible to split my existing
index, and if so, how?
Re: spliting an index
Posted by Dennis Kubes <ku...@apache.org>.
Short answer, not way to do it currently. Now for the long answer.
You can handle searching in two ways:
1) Have a single massive index and segments, merge everything including
segments and indexes. Then split the indexes and segments (don't forget
having to split the segments otherwise you would have all content from
all segments on each distributed search server machine), and put an
index of a given size/number of urls on each search server.
This doesn't work for a couple of different reasons. The first being
once the data gets to big you are going to be spending a massive amount
of time merging, sorting, and splitting. Two, to split an already
created index you would need a program (and programmer) that understood
the internals of lucene serialization because non-stored fields don't
get though serialization and so copying lucene documents from one index
to another won't work.
2) Crawl x number of urls, say 10 million. Process those urls,
including updatedb, and any global program processing such as LinkRank.
While you would have different segments and index (shards) you would
probably have a single master crawldb. Index just that crawled
segments. No merging takes place. Deploy that index and its segments
out to a given search server. Rinse, lather, repeat with as many search
servers as you have.
Some caveats. Crawl the highest scoring urls first then move down to
lower and lowest. For example first 10m are best scoring according to
LinkRank, next 10M are a little lower scoring, etc. As you move down
the crawl will take longer and longer because you will have more "bad"
urls. On some crawls we have seen as high as 50% bad urls.
Once you get to the number of search servers you have, you reset the
crawldb crawled times (not scores) and start over completely with your
crawl. With each search shard (indexes and segments), when completed
and ready to be deployed, you would bring down that single search
server, do the deployment, and bring it back up. The entire search
always stays up you are just always replacing shards continuously in the
background.
Dennis
charlie w wrote:
> With regard to distributed search I see lots of discussion about splitting
> the index, but no actual discussion about specifically how that's done.
>
> I have a small, but growing, index. Is it possible to split my existing
> index, and if so, how?
>
Re: spliting an index
Posted by lei wang <nu...@gmail.com>.
I agree with you that we should spilit up index at the stage of
indexing. We are thinking on the same page. Maybe we can read index file
directory and segements directory in nutch api, and spilt segements file
dir by documents, and build index on each segements file?
nutch claim that it a distributed search engine, it is too *clumsy to
depoly lager-scale index distributed. *
On Wed, Jun 17, 2009 at 1:17 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:
> from the top of my head I am afraid this in no direct solution. Index can
> be
> split at the stage of indexing by setting the parameter number of URLs in
> an
> index but you also have segments which store page content. And you cannot
> alter them after you have done indexing.
>
> Probably the solution might be to split up segments first and then index
> each part separately but this approach would require certain amount of
> development.
>
> Correct me if I am wrong.
>
> 2009/6/17 beyiwork <nu...@gmail.com>
>
> >
> > i am considering this problem now , anyone can help?
> >
> > charlie w wrote:
> > >
> > > With regard to distributed search I see lots of discussion about
> > splitting
> > > the index, but no actual discussion about specifically how that's done.
> > >
> > > I have a small, but growing, index. Is it possible to split my
> existing
> > > index, and if so, how?
> > >
> > >
> >
> > --
> > View this message in context:
> > http://www.nabble.com/spliting-an-index-tp11928411p24066856.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>
Re: spliting an index
Posted by Alexander Aristov <al...@gmail.com>.
from the top of my head I am afraid this in no direct solution. Index can be
split at the stage of indexing by setting the parameter number of URLs in an
index but you also have segments which store page content. And you cannot
alter them after you have done indexing.
Probably the solution might be to split up segments first and then index
each part separately but this approach would require certain amount of
development.
Correct me if I am wrong.
2009/6/17 beyiwork <nu...@gmail.com>
>
> i am considering this problem now , anyone can help?
>
> charlie w wrote:
> >
> > With regard to distributed search I see lots of discussion about
> splitting
> > the index, but no actual discussion about specifically how that's done.
> >
> > I have a small, but growing, index. Is it possible to split my existing
> > index, and if so, how?
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/spliting-an-index-tp11928411p24066856.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Re: spliting an index
Posted by beyiwork <nu...@gmail.com>.
i am considering this problem now , anyone can help?
charlie w wrote:
>
> With regard to distributed search I see lots of discussion about splitting
> the index, but no actual discussion about specifically how that's done.
>
> I have a small, but growing, index. Is it possible to split my existing
> index, and if so, how?
>
>
--
View this message in context: http://www.nabble.com/spliting-an-index-tp11928411p24066856.html
Sent from the Nutch - User mailing list archive at Nabble.com.