You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Giuseppe Cannella <ca...@libero.it> on 2006/11/25 20:51:00 UTC
distributed search
in http://wiki.apache.org/nutch/NutchHadoopTutorial page
at 'Distributed Searching' section
i read:
"On each of the search servers you would use the startup the distributed search server by using the nutch server command like this:
bin/nutch server 1234 /d01/local/crawled"
but /d01/local/crawled has been created only for the first server, how could i create it for all server?
if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)
------------------------------------------------------
Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
http://click.libero.it/infostrada25nov06
Re: distributed search
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> All,
>>
>> We have two version of a type of index splitter. The first version
>> would run an indexing job and then using the completed index as input
>> would read the number of documents in the index and take a requested
>> split size. From this it used a custom index input format to create
>> splits according to document id. We would run a job that would map
>> out index urls as keys and documents with their ids wrapped in a
>> SerializableWritable object as the values. Then inside of a second
>> job using the index as input we would have a MapRunner that would
>> read the other supporting databases (linkdb, segments) and map all
>> objects as ObjectWritables. Then on the reduce we had a custom
>> Output and OutputFormat that took all of the objects and wrote out
>> the databases and indexes into each split.
>>
>> There was a problem with this first approach though in that writing
>> out an index from a previously serialized document would lose any
>> fields that are not stored (which is most). So we went with a second
>> approach.
>
> Yes, that's the nature of the beast - I sort of hoped that you
> implemented a true splitter, which directly splits term posting lists
> according to doc id. This is possible, and doesn't require using
> stored fields - the only problem being that someone well acquainted
> with Lucene internals needs to write it .. ;)
If you can explain more about this we can start taking a look in this
direction. I would love to get this working by a type of doc id.
>
>>
>> The second approach takes a number of splits and runs through an
>> indexing job on the fly. It calls the indexing and scoring filters.
>> It uses the linkdb, crawldb, and segments as input. As it indexes is
>> also splits the databases and indexes into the number of reduce tasks
>> so that the final output is multiple splits each hold a part of the
>> index and its supporting databases. Each of the databases holds
>> only the information for the urls that are in its part of the index.
>> These parts can then be pushed to separate search servers. This type
>> of splitting works well but you can NOT define a specific number of
>> documents or urls per split and sometimes one split will have alot
>> more urls than another if you
>
> Why? it depends on the partitioner, or the output format that you are
> using. E.g. in SegmentMerger I implemented an OutputFormat that
> produces several segments simultaneously, writing to their respective
> parts depending on a value of metadata. This value may be used to
> switch sequentially between output "slices" so that urls are spread
> evenly across the "slices".
We were using the default HashPartitioner using urls as keys but yes we
are currently looking into custom Partioners to solve the problem of too
uneven distribution. Correct me if I am wrong but I though two machines
couldn't write to the same file at the same time. For example if
machine X is writing part-00001 then machine Y can't also write to
part-00001 in the same job.
>
>> are indexing some sites that have alot of pages (i.e. wikipedia or
>> cnn archives). This is currently how our system works. We fetch,
>> invert links, run through some other processes, and then index and
>> split on the fly. Then we use python scripts to pull each split
>> directly from the DFS to each search server and then start the search
>> servers.
>>
>> We are still working on the splitter because the ideal approach would
>> be to be able to specify a number of documents per split as well as
>> to group by different keys, not just url. I would be happy to share
>> the current code but it is highly integrated so I would need to pull
>> it out of our code base first. It would be best if I could send it
>> to someone, say Andrzej, to take a look at first.
>
> Unless I'm missing something, SegmentMerger.SegmentOutputFormat should
> satisfy these requirements, you would just need to modify the job
> implementation ...
From what I am seeing the SegmentMerger only handles segments. The
splitter handle segments, linkdb, and does indexing on the fly. The
idea being that I didn't want to have to split and then index segments
individually and I only wanted the linkdb and segments to hold the
information they needed for each split.
Re: distributed search
Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> All,
>
> We have two version of a type of index splitter. The first version
> would run an indexing job and then using the completed index as input
> would read the number of documents in the index and take a requested
> split size. From this it used a custom index input format to create
> splits according to document id. We would run a job that would map
> out index urls as keys and documents with their ids wrapped in a
> SerializableWritable object as the values. Then inside of a second
> job using the index as input we would have a MapRunner that would read
> the other supporting databases (linkdb, segments) and map all objects
> as ObjectWritables. Then on the reduce we had a custom Output and
> OutputFormat that took all of the objects and wrote out the databases
> and indexes into each split.
>
> There was a problem with this first approach though in that writing
> out an index from a previously serialized document would lose any
> fields that are not stored (which is most). So we went with a second
> approach.
Yes, that's the nature of the beast - I sort of hoped that you
implemented a true splitter, which directly splits term posting lists
according to doc id. This is possible, and doesn't require using stored
fields - the only problem being that someone well acquainted with Lucene
internals needs to write it .. ;)
>
> The second approach takes a number of splits and runs through an
> indexing job on the fly. It calls the indexing and scoring filters.
> It uses the linkdb, crawldb, and segments as input. As it indexes is
> also splits the databases and indexes into the number of reduce tasks
> so that the final output is multiple splits each hold a part of the
> index and its supporting databases. Each of the databases holds only
> the information for the urls that are in its part of the index. These
> parts can then be pushed to separate search servers. This type of
> splitting works well but you can NOT define a specific number of
> documents or urls per split and sometimes one split will have alot
> more urls than another if you
Why? it depends on the partitioner, or the output format that you are
using. E.g. in SegmentMerger I implemented an OutputFormat that produces
several segments simultaneously, writing to their respective parts
depending on a value of metadata. This value may be used to switch
sequentially between output "slices" so that urls are spread evenly
across the "slices".
> are indexing some sites that have alot of pages (i.e. wikipedia or cnn
> archives). This is currently how our system works. We fetch, invert
> links, run through some other processes, and then index and split on
> the fly. Then we use python scripts to pull each split directly from
> the DFS to each search server and then start the search servers.
>
> We are still working on the splitter because the ideal approach would
> be to be able to specify a number of documents per split as well as to
> group by different keys, not just url. I would be happy to share the
> current code but it is highly integrated so I would need to pull it
> out of our code base first. It would be best if I could send it to
> someone, say Andrzej, to take a look at first.
Unless I'm missing something, SegmentMerger.SegmentOutputFormat should
satisfy these requirements, you would just need to modify the job
implementation ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: distributed search
Posted by Chad Walters <ch...@powerset.com>.
Dennis,
You should consider splitting based on a function that will give you a more
uniform distribution (e.g.: MD5(url)). That way, you should see much less of
a variation in the number of documents per partition.
Chad
On 12/4/06 3:29 PM, "Dennis Kubes" <nu...@dragonflymc.com> wrote:
> All,
>
> We have two version of a type of index splitter. The first version
> would run an indexing job and then using the completed index as input
> would read the number of documents in the index and take a requested
> split size. From this it used a custom index input format to create
> splits according to document id. We would run a job that would map out
> index urls as keys and documents with their ids wrapped in a
> SerializableWritable object as the values. Then inside of a second job
> using the index as input we would have a MapRunner that would read the
> other supporting databases (linkdb, segments) and map all objects as
> ObjectWritables. Then on the reduce we had a custom Output and
> OutputFormat that took all of the objects and wrote out the databases
> and indexes into each split.
>
> There was a problem with this first approach though in that writing out
> an index from a previously serialized document would lose any fields
> that are not stored (which is most). So we went with a second approach.
>
> The second approach takes a number of splits and runs through an
> indexing job on the fly. It calls the indexing and scoring filters. It
> uses the linkdb, crawldb, and segments as input. As it indexes is also
> splits the databases and indexes into the number of reduce tasks so that
> the final output is multiple splits each hold a part of the index and
> its supporting databases. Each of the databases holds only the
> information for the urls that are in its part of the index. These parts
> can then be pushed to separate search servers. This type of splitting
> works well but you can NOT define a specific number of documents or urls
> per split and sometimes one split will have alot more urls than another
> if you are indexing some sites that have alot of pages (i.e. wikipedia
> or cnn archives). This is currently how our system works. We fetch,
> invert links, run through some other processes, and then index and split
> on the fly. Then we use python scripts to pull each split directly from
> the DFS to each search server and then start the search servers.
>
> We are still working on the splitter because the ideal approach would be
> to be able to specify a number of documents per split as well as to
> group by different keys, not just url. I would be happy to share the
> current code but it is highly integrated so I would need to pull it out
> of our code base first. It would be best if I could send it to someone,
> say Andrzej, to take a look at first.
>
> Dennis
>
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>>
>> [...]
>>> Having a new index on each machine and having to create separate
>>> indexes is not the most elegant way to accomplish this architecture.
>>> The best way that we have found is to have an splitter job that
>>> indexes and splits the index and
>>
>> Have you implemented a Lucene index splitter, i.e. a tool that takes
>> an existing Lucene index and splits it into parts by document id? This
>> sounds very interesting - could you tell us a bit about this?
>>
Re: distributed search
Posted by Dennis Kubes <nu...@dragonflymc.com>.
All,
We have two version of a type of index splitter. The first version
would run an indexing job and then using the completed index as input
would read the number of documents in the index and take a requested
split size. From this it used a custom index input format to create
splits according to document id. We would run a job that would map out
index urls as keys and documents with their ids wrapped in a
SerializableWritable object as the values. Then inside of a second job
using the index as input we would have a MapRunner that would read the
other supporting databases (linkdb, segments) and map all objects as
ObjectWritables. Then on the reduce we had a custom Output and
OutputFormat that took all of the objects and wrote out the databases
and indexes into each split.
There was a problem with this first approach though in that writing out
an index from a previously serialized document would lose any fields
that are not stored (which is most). So we went with a second approach.
The second approach takes a number of splits and runs through an
indexing job on the fly. It calls the indexing and scoring filters. It
uses the linkdb, crawldb, and segments as input. As it indexes is also
splits the databases and indexes into the number of reduce tasks so that
the final output is multiple splits each hold a part of the index and
its supporting databases. Each of the databases holds only the
information for the urls that are in its part of the index. These parts
can then be pushed to separate search servers. This type of splitting
works well but you can NOT define a specific number of documents or urls
per split and sometimes one split will have alot more urls than another
if you are indexing some sites that have alot of pages (i.e. wikipedia
or cnn archives). This is currently how our system works. We fetch,
invert links, run through some other processes, and then index and split
on the fly. Then we use python scripts to pull each split directly from
the DFS to each search server and then start the search servers.
We are still working on the splitter because the ideal approach would be
to be able to specify a number of documents per split as well as to
group by different keys, not just url. I would be happy to share the
current code but it is highly integrated so I would need to pull it out
of our code base first. It would be best if I could send it to someone,
say Andrzej, to take a look at first.
Dennis
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>
> [...]
>> Having a new index on each machine and having to create separate
>> indexes is not the most elegant way to accomplish this architecture.
>> The best way that we have found is to have an splitter job that
>> indexes and splits the index and
>
> Have you implemented a Lucene index splitter, i.e. a tool that takes
> an existing Lucene index and splits it into parts by document id? This
> sounds very interesting - could you tell us a bit about this?
>
Re: distributed search
Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
[...]
> Having a new index on each machine and having to create separate
> indexes is not the most elegant way to accomplish this architecture.
> The best way that we have found is to have an splitter job that
> indexes and splits the index and
Have you implemented a Lucene index splitter, i.e. a tool that takes an
existing Lucene index and splits it into parts by document id? This
sounds very interesting - could you tell us a bit about this?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: distributed search
Posted by Dennis Kubes <nu...@dragonflymc.com>.
The distributed searching section assumes that you have split the index
into multiple pieces and there is a piece on each machine. The tutorial
doesn't tell you how to split the indexes because there is not tool to
do that yet. I was trying to layout a general architecture for how to
do distributed searching instead of giving a step by step method. What
I would do for now is to create multiple indexes of say 2-4 million
pages and put each index on a separate machine. You would also need to
copy all of the supporting database file such as the crawl db and link
db to each machine.
Having a new index on each machine and having to create separate indexes
is not the most elegant way to accomplish this architecture. The best
way that we have found is to have an splitter job that indexes and
splits the index and supporting databases into multiple parts on the
fly. Then these parts are moved out to the search servers. We have
some base code for this but it is not in the nutch codebase as of yet.
If you want to move down this path send me an email.
Dennis
Giuseppe Cannella wrote:
> in http://wiki.apache.org/nutch/NutchHadoopTutorial page
>
> at 'Distributed Searching' section
> i read:
>
> "On each of the search servers you would use the startup the distributed search server by using the nutch server command like this:
> bin/nutch server 1234 /d01/local/crawled"
>
> but /d01/local/crawled has been created only for the first server, how could i create it for all server?
> if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)
>
>
> ------------------------------------------------------
> Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
> http://click.libero.it/infostrada25nov06
>
>
>