You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jesse Hires <jh...@gmail.com> on 2009/09/23 04:59:25 UTC

splitting an index (yes, again)

My apologies in advance.

I've been digging through the mail archives searching for information on
splitting the index after crawling, but I am getting even more confused or
the information is too incomplete for a newbie like myself.

I see reference to using mergesegs, but not enough to make an educated guess
(at least at my level, which I admit is low right now).

I've gotten to the point of having worked my way through the tutorial here:
http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial
and have a working site using a single computer. I have four more computers
to add, and would like to try distributed search.

When I read that tutorial to the Distributed Searching portion followed by
"split the index" it mentions this link:
http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917

But that may as well be saying "then some magic happens".

Does anyone have "step by step" instructions for spitting the index for use
in distributed search using mergesegs or otherwise? It doesn't have to have
a lot of explanation, just a list of example steps.


Mostly this is experimental for me with no major plans than my own
education, but because I am starting completely fresh at this, some things
are still quite confusing.

Thanks,
Jesse

Re: splitting an index (yes, again)

Posted by Jesse Hires <jh...@gmail.com>.

Perhaps I have my terminology wrong, so I am looking at this the wrong way.
If I want to distribute my search across multiple nodes, having only a
portion of the data on each node, is this just a matter of using mergesegs
to get the number and size of segments I want, then rebuild the index (house
cleaning, dedup, invert, etc) with the new quantity of segments, then copy
only a portion of the segments to each search server, as well as the whole
crawldb, index,indexes, and linkdb directories?

So,
1. mergesegs ...... -split 100 (creates 20 segments)
2. index, invert, dedupe and stuff
3. scp 10 segments to node 1
4. scp 10 segments to node 2

As long as slave nodes are configured correctly, a search will span both
nodes?

Is the above somewhat correct?

The split index questions others have asked, as well as seeing that others
are indexing 50m+ pages across several nodes, leads me to believe there is
some sort of standard process or tools for distributing the index and
segments across multiple nodes.

So far I don't have enough understanding of the terminology to know what to
search for, or others are keeping tight lipped about how they are doing
this.




If I do manage to get this working, with the help of others, I'd be willing
to write up a quick tutorial/faq about this to hopefully stop newbies like
me from asking this over and over again. :-)


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Wed, Sep 23, 2009 at 5:48 AM, Jesse Hires <jh...@gmail.com> wrote:

> Exactly! sorry for being so confusing in my original question.
>
>
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
>
> On Wed, Sep 23, 2009 at 4:45 AM, Alexander Aristov <
> alexander.aristov@gmail.com> wrote:
>
>> Ok, I will paraphrase the question.
>>
>> Consider I want to use distributed search using 3 servers: one primary and
>> two secondary nodes.
>>
>> I create single BIG index using distributed crawler using other computers.
>> Now I want to split this single BIG index on two parts to put on the
>> search
>> nodes.
>>
>> How can it be achieved?
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> 2009/9/23 Koch Martina <Ko...@huberverlag.de>
>>
>> > Hi Jesse,
>> >
>> > I'm not sure what you're trying to achieve. Do you want to use the
>> > distributed search or do you want to split an existing index? None of
>> these
>> > tasks is the prerequisite for the other.
>> > If you want to split an index, there are several ways to do this. Which
>> way
>> > to choose depends on the reason for the split.
>> > If you want to use the distributed search, you just need two or more
>> > separate indexes, start a search server for each and configure your
>> > searcher.dir property in nutch-site xml to point to the
>> search-servers.txt
>> > file, where you entered the hosts and ports of your search servers
>> (detailed
>> > description:
>> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html
>> ).
>> >
>> > Kind regards,
>> > Martina
>> >
>> >
>> > -----Ursprüngliche Nachricht-----
>> > Von: Jesse Hires [mailto:jhires@gmail.com]
>> > Gesendet: Mittwoch, 23. September 2009 04:59
>> > An: nutch-user@lucene.apache.org
>> > Betreff: splitting an index (yes, again)
>> >
>> > My apologies in advance.
>> >
>> > I've been digging through the mail archives searching for information on
>> > splitting the index after crawling, but I am getting even more confused
>> or
>> > the information is too incomplete for a newbie like myself.
>> >
>> > I see reference to using mergesegs, but not enough to make an educated
>> > guess
>> > (at least at my level, which I admit is low right now).
>> >
>> > I've gotten to the point of having worked my way through the tutorial
>> here:
>> > http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial
>> > and have a working site using a single computer. I have four more
>> computers
>> > to add, and would like to try distributed search.
>> >
>> > When I read that tutorial to the Distributed Searching portion followed
>> by
>> > "split the index" it mentions this link:
>> >
>> >
>> http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917
>> >
>> > But that may as well be saying "then some magic happens".
>> >
>> > Does anyone have "step by step" instructions for spitting the index for
>> use
>> > in distributed search using mergesegs or otherwise? It doesn't have to
>> have
>> > a lot of explanation, just a list of example steps.
>> >
>> >
>> > Mostly this is experimental for me with no major plans than my own
>> > education, but because I am starting completely fresh at this, some
>> things
>> > are still quite confusing.
>> >
>> > Thanks,
>> > Jesse
>> >
>>
>
>

Re: splitting an index (yes, again)

Posted by Jesse Hires <jh...@gmail.com>.

Exactly! sorry for being so confusing in my original question.



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Wed, Sep 23, 2009 at 4:45 AM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> Ok, I will paraphrase the question.
>
> Consider I want to use distributed search using 3 servers: one primary and
> two secondary nodes.
>
> I create single BIG index using distributed crawler using other computers.
> Now I want to split this single BIG index on two parts to put on the search
> nodes.
>
> How can it be achieved?
>
> Best Regards
> Alexander Aristov
>
>
> 2009/9/23 Koch Martina <Ko...@huberverlag.de>
>
> > Hi Jesse,
> >
> > I'm not sure what you're trying to achieve. Do you want to use the
> > distributed search or do you want to split an existing index? None of
> these
> > tasks is the prerequisite for the other.
> > If you want to split an index, there are several ways to do this. Which
> way
> > to choose depends on the reason for the split.
> > If you want to use the distributed search, you just need two or more
> > separate indexes, start a search server for each and configure your
> > searcher.dir property in nutch-site xml to point to the
> search-servers.txt
> > file, where you entered the hosts and ports of your search servers
> (detailed
> > description:
> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html).
> >
> > Kind regards,
> > Martina
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Jesse Hires [mailto:jhires@gmail.com]
> > Gesendet: Mittwoch, 23. September 2009 04:59
> > An: nutch-user@lucene.apache.org
> > Betreff: splitting an index (yes, again)
> >
> > My apologies in advance.
> >
> > I've been digging through the mail archives searching for information on
> > splitting the index after crawling, but I am getting even more confused
> or
> > the information is too incomplete for a newbie like myself.
> >
> > I see reference to using mergesegs, but not enough to make an educated
> > guess
> > (at least at my level, which I admit is low right now).
> >
> > I've gotten to the point of having worked my way through the tutorial
> here:
> > http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial
> > and have a working site using a single computer. I have four more
> computers
> > to add, and would like to try distributed search.
> >
> > When I read that tutorial to the Distributed Searching portion followed
> by
> > "split the index" it mentions this link:
> >
> >
> http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917
> >
> > But that may as well be saying "then some magic happens".
> >
> > Does anyone have "step by step" instructions for spitting the index for
> use
> > in distributed search using mergesegs or otherwise? It doesn't have to
> have
> > a lot of explanation, just a list of example steps.
> >
> >
> > Mostly this is experimental for me with no major plans than my own
> > education, but because I am starting completely fresh at this, some
> things
> > are still quite confusing.
> >
> > Thanks,
> > Jesse
> >
>

Re: splitting an index (yes, again)

Posted by Alexander Aristov <al...@gmail.com>.

Ok, I will paraphrase the question.

Consider I want to use distributed search using 3 servers: one primary and
two secondary nodes.

I create single BIG index using distributed crawler using other computers.
Now I want to split this single BIG index on two parts to put on the search
nodes.

How can it be achieved?

Best Regards
Alexander Aristov


2009/9/23 Koch Martina <Ko...@huberverlag.de>

> Hi Jesse,
>
> I'm not sure what you're trying to achieve. Do you want to use the
> distributed search or do you want to split an existing index? None of these
> tasks is the prerequisite for the other.
> If you want to split an index, there are several ways to do this. Which way
> to choose depends on the reason for the split.
> If you want to use the distributed search, you just need two or more
> separate indexes, start a search server for each and configure your
> searcher.dir property in nutch-site xml to point to the search-servers.txt
> file, where you entered the hosts and ports of your search servers (detailed
> description:
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html).
>
> Kind regards,
> Martina
>
>
> -----Ursprüngliche Nachricht-----
> Von: Jesse Hires [mailto:jhires@gmail.com]
> Gesendet: Mittwoch, 23. September 2009 04:59
> An: nutch-user@lucene.apache.org
> Betreff: splitting an index (yes, again)
>
> My apologies in advance.
>
> I've been digging through the mail archives searching for information on
> splitting the index after crawling, but I am getting even more confused or
> the information is too incomplete for a newbie like myself.
>
> I see reference to using mergesegs, but not enough to make an educated
> guess
> (at least at my level, which I admit is low right now).
>
> I've gotten to the point of having worked my way through the tutorial here:
> http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial
> and have a working site using a single computer. I have four more computers
> to add, and would like to try distributed search.
>
> When I read that tutorial to the Distributed Searching portion followed by
> "split the index" it mentions this link:
>
> http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917
>
> But that may as well be saying "then some magic happens".
>
> Does anyone have "step by step" instructions for spitting the index for use
> in distributed search using mergesegs or otherwise? It doesn't have to have
> a lot of explanation, just a list of example steps.
>
>
> Mostly this is experimental for me with no major plans than my own
> education, but because I am starting completely fresh at this, some things
> are still quite confusing.
>
> Thanks,
> Jesse
>

AW: splitting an index (yes, again)

Posted by Koch Martina <Ko...@huberverlag.de>.

Hi Jesse,

I'm not sure what you're trying to achieve. Do you want to use the distributed search or do you want to split an existing index? None of these tasks is the prerequisite for the other.
If you want to split an index, there are several ways to do this. Which way to choose depends on the reason for the split.
If you want to use the distributed search, you just need two or more separate indexes, start a search server for each and configure your searcher.dir property in nutch-site xml to point to the search-servers.txt file, where you entered the hosts and ports of your search servers (detailed description: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html).

Kind regards,
Martina

-----Ursprüngliche Nachricht-----
Von: Jesse Hires [mailto:jhires@gmail.com]
Gesendet: Mittwoch, 23. September 2009 04:59
An: nutch-user@lucene.apache.org
Betreff: splitting an index (yes, again)

My apologies in advance.

I've been digging through the mail archives searching for information on
splitting the index after crawling, but I am getting even more confused or
the information is too incomplete for a newbie like myself.

I see reference to using mergesegs, but not enough to make an educated guess
(at least at my level, which I admit is low right now).

I've gotten to the point of having worked my way through the tutorial here:
http://wiki.apache.org/nutch/Nutch0.9-Hadoop0.10-Tutorial
and have a working site using a single computer. I have four more computers
to add, and would like to try distributed search.

When I read that tutorial to the Distributed Searching portion followed by
"split the index" it mentions this link:
http://wiki.apache.org/nutch/%5Bhttp%3A//www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917

But that may as well be saying "then some magic happens".

Does anyone have "step by step" instructions for spitting the index for use
in distributed search using mergesegs or otherwise? It doesn't have to have
a lot of explanation, just a list of example steps.

Mostly this is experimental for me with no major plans than my own
education, but because I am starting completely fresh at this, some things
are still quite confusing.

Thanks,
Jesse