You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Giuseppe Cannella <ca...@libero.it> on 2006/11/25 20:51:00 UTC

distributed search

in http://wiki.apache.org/nutch/NutchHadoopTutorial page

at 'Distributed Searching' section 
i read:

"On each of the search servers you would use the startup the distributed search server by using the nutch server command like this:
bin/nutch server 1234 /d01/local/crawled"

but /d01/local/crawled has been created only for the first server, how could i create it for all server? 
if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)


------------------------------------------------------
Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
http://click.libero.it/infostrada25nov06



Re: distributed search

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> All,
>>
>> We have two version of a type of index splitter.  The first version 
>> would run an indexing job and then using the completed index as input 
>> would read the number of documents in the index and take a requested 
>> split size.  From this it used a custom index input format to create 
>> splits according to document id.   We would run a job that would map 
>> out index urls as keys and documents with their ids wrapped in a 
>> SerializableWritable object as the values.  Then inside of a second 
>> job using the index as input we would have a MapRunner that would 
>> read the other supporting databases (linkdb, segments) and map all 
>> objects as ObjectWritables.  Then on the reduce we had a custom 
>> Output and OutputFormat that took all of the objects and wrote out 
>> the databases and indexes into each split.
>>
>> There was a problem with this first approach though in that writing 
>> out an index from a previously serialized document would lose any 
>> fields that are not stored (which is most).  So we went with a second 
>> approach.
>
> Yes, that's the nature of the beast - I sort of hoped that you 
> implemented a true splitter, which directly splits term posting lists 
> according to doc id. This is possible, and doesn't require using 
> stored fields - the only problem being that someone well acquainted 
> with Lucene internals needs to write it .. ;)
If you can explain more about this we can start taking a look in this 
direction.  I would love to get this working by a type of doc id.
>
>>
>> The second approach takes a number of splits and runs through an 
>> indexing job on the fly.  It calls the indexing and scoring filters.  
>> It uses the linkdb, crawldb, and segments as input.  As it indexes is 
>> also splits the databases and indexes into the number of reduce tasks 
>> so that the final output is multiple splits each hold a part of the 
>> index and its supporting databases.  Each of  the databases holds 
>> only the information for the urls that are in its part of the index.  
>> These parts can then be pushed to separate search servers.  This type 
>> of splitting works well but you can NOT define a specific number of 
>> documents or urls per split and sometimes one split will have alot 
>> more urls than another if you
>
> Why? it depends on the partitioner, or the output format that you are 
> using. E.g. in SegmentMerger I implemented an OutputFormat that 
> produces several segments simultaneously, writing to their respective 
> parts depending on a value of metadata. This value may be used to 
> switch sequentially between output "slices" so that urls are spread 
> evenly across the "slices".
We were using the default HashPartitioner using urls as keys but yes we 
are currently looking into custom Partioners to solve the problem of too 
uneven distribution.  Correct me if I am wrong but I though two machines 
couldn't write to the same file at the same time.  For example if 
machine X is writing part-00001 then machine Y can't also write to 
part-00001 in the same job.
>
>> are indexing some sites that have alot of pages (i.e. wikipedia or 
>> cnn archives).  This is currently how our system works.  We fetch, 
>> invert links, run through some other processes, and then index and 
>> split on the fly.  Then we use python scripts to pull each split 
>> directly from the DFS to each search server and then start the search 
>> servers.
>>
>> We are still working on the splitter because the ideal approach would 
>> be to be able to specify a number of documents per split as well as 
>> to group by different keys, not just url.  I would be happy to share 
>> the current code but it is highly integrated so I would need to pull 
>> it out of our code base first.  It would be best if I could send it 
>> to someone, say Andrzej, to take a look at first.
>
> Unless I'm missing something, SegmentMerger.SegmentOutputFormat should 
> satisfy these requirements, you would just need to modify the job 
> implementation ...
 From what I am seeing the SegmentMerger only handles segments.  The 
splitter handle segments, linkdb, and does indexing on the fly.  The 
idea being that I didn't want to have to split and then index segments 
individually and I only wanted the linkdb and segments to hold the 
information they needed for each split.

Re: distributed search

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> All,
>
> We have two version of a type of index splitter.  The first version 
> would run an indexing job and then using the completed index as input 
> would read the number of documents in the index and take a requested 
> split size.  From this it used a custom index input format to create 
> splits according to document id.   We would run a job that would map 
> out index urls as keys and documents with their ids wrapped in a 
> SerializableWritable object as the values.  Then inside of a second 
> job using the index as input we would have a MapRunner that would read 
> the other supporting databases (linkdb, segments) and map all objects 
> as ObjectWritables.  Then on the reduce we had a custom Output and 
> OutputFormat that took all of the objects and wrote out the databases 
> and indexes into each split.
>
> There was a problem with this first approach though in that writing 
> out an index from a previously serialized document would lose any 
> fields that are not stored (which is most).  So we went with a second 
> approach.

Yes, that's the nature of the beast - I sort of hoped that you 
implemented a true splitter, which directly splits term posting lists 
according to doc id. This is possible, and doesn't require using stored 
fields - the only problem being that someone well acquainted with Lucene 
internals needs to write it .. ;)


>
> The second approach takes a number of splits and runs through an 
> indexing job on the fly.  It calls the indexing and scoring filters.  
> It uses the linkdb, crawldb, and segments as input.  As it indexes is 
> also splits the databases and indexes into the number of reduce tasks 
> so that the final output is multiple splits each hold a part of the 
> index and its supporting databases.  Each of  the databases holds only 
> the information for the urls that are in its part of the index.  These 
> parts can then be pushed to separate search servers.  This type of 
> splitting works well but you can NOT define a specific number of 
> documents or urls per split and sometimes one split will have alot 
> more urls than another if you

Why? it depends on the partitioner, or the output format that you are 
using. E.g. in SegmentMerger I implemented an OutputFormat that produces 
several segments simultaneously, writing to their respective parts 
depending on a value of metadata. This value may be used to switch 
sequentially between output "slices" so that urls are spread evenly 
across the "slices".

> are indexing some sites that have alot of pages (i.e. wikipedia or cnn 
> archives).  This is currently how our system works.  We fetch, invert 
> links, run through some other processes, and then index and split on 
> the fly.  Then we use python scripts to pull each split directly from 
> the DFS to each search server and then start the search servers.
>
> We are still working on the splitter because the ideal approach would 
> be to be able to specify a number of documents per split as well as to 
> group by different keys, not just url.  I would be happy to share the 
> current code but it is highly integrated so I would need to pull it 
> out of our code base first.  It would be best if I could send it to 
> someone, say Andrzej, to take a look at first.

Unless I'm missing something, SegmentMerger.SegmentOutputFormat should 
satisfy these requirements, you would just need to modify the job 
implementation ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: distributed search

Posted by Chad Walters <ch...@powerset.com>.
Dennis,

You should consider splitting based on a function that will give you a more
uniform distribution (e.g.: MD5(url)). That way, you should see much less of
a variation in the number of documents per partition.

Chad

On 12/4/06 3:29 PM, "Dennis Kubes" <nu...@dragonflymc.com> wrote:

> All,
> 
> We have two version of a type of index splitter.  The first version
> would run an indexing job and then using the completed index as input
> would read the number of documents in the index and take a requested
> split size.  From this it used a custom index input format to create
> splits according to document id.   We would run a job that would map out
> index urls as keys and documents with their ids wrapped in a
> SerializableWritable object as the values.  Then inside of a second job
> using the index as input we would have a MapRunner that would read the
> other supporting databases (linkdb, segments) and map all objects as
> ObjectWritables.  Then on the reduce we had a custom Output and
> OutputFormat that took all of the objects and wrote out the databases
> and indexes into each split.
> 
> There was a problem with this first approach though in that writing out
> an index from a previously serialized document would lose any fields
> that are not stored (which is most).  So we went with a second approach.
> 
> The second approach takes a number of splits and runs through an
> indexing job on the fly.  It calls the indexing and scoring filters.  It
> uses the linkdb, crawldb, and segments as input.  As it indexes is also
> splits the databases and indexes into the number of reduce tasks so that
> the final output is multiple splits each hold a part of the index and
> its supporting databases.  Each of  the databases holds only the
> information for the urls that are in its part of the index.  These parts
> can then be pushed to separate search servers.  This type of splitting
> works well but you can NOT define a specific number of documents or urls
> per split and sometimes one split will have alot more urls than another
> if you are indexing some sites that have alot of pages (i.e. wikipedia
> or cnn archives).  This is currently how our system works.  We fetch,
> invert links, run through some other processes, and then index and split
> on the fly.  Then we use python scripts to pull each split directly from
> the DFS to each search server and then start the search servers.
> 
> We are still working on the splitter because the ideal approach would be
> to be able to specify a number of documents per split as well as to
> group by different keys, not just url.  I would be happy to share the
> current code but it is highly integrated so I would need to pull it out
> of our code base first.  It would be best if I could send it to someone,
> say Andrzej, to take a look at first.
> 
> Dennis
> 
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>> 
>> [...]
>>> Having a new index on each machine and having to create separate
>>> indexes is not the most elegant way to accomplish this architecture.
>>> The best way that we have found is to have an splitter job that
>>> indexes and splits the index and
>> 
>> Have you implemented a Lucene index splitter, i.e. a tool that takes
>> an existing Lucene index and splits it into parts by document id? This
>> sounds very interesting - could you tell us a bit about this?
>> 


Re: distributed search

Posted by Dennis Kubes <nu...@dragonflymc.com>.
All,

We have two version of a type of index splitter.  The first version 
would run an indexing job and then using the completed index as input 
would read the number of documents in the index and take a requested 
split size.  From this it used a custom index input format to create 
splits according to document id.   We would run a job that would map out 
index urls as keys and documents with their ids wrapped in a 
SerializableWritable object as the values.  Then inside of a second job 
using the index as input we would have a MapRunner that would read the 
other supporting databases (linkdb, segments) and map all objects as 
ObjectWritables.  Then on the reduce we had a custom Output and 
OutputFormat that took all of the objects and wrote out the databases 
and indexes into each split.

There was a problem with this first approach though in that writing out 
an index from a previously serialized document would lose any fields 
that are not stored (which is most).  So we went with a second approach.

The second approach takes a number of splits and runs through an 
indexing job on the fly.  It calls the indexing and scoring filters.  It 
uses the linkdb, crawldb, and segments as input.  As it indexes is also 
splits the databases and indexes into the number of reduce tasks so that 
the final output is multiple splits each hold a part of the index and 
its supporting databases.  Each of  the databases holds only the 
information for the urls that are in its part of the index.  These parts 
can then be pushed to separate search servers.  This type of splitting 
works well but you can NOT define a specific number of documents or urls 
per split and sometimes one split will have alot more urls than another 
if you are indexing some sites that have alot of pages (i.e. wikipedia 
or cnn archives).  This is currently how our system works.  We fetch, 
invert links, run through some other processes, and then index and split 
on the fly.  Then we use python scripts to pull each split directly from 
the DFS to each search server and then start the search servers.

We are still working on the splitter because the ideal approach would be 
to be able to specify a number of documents per split as well as to 
group by different keys, not just url.  I would be happy to share the 
current code but it is highly integrated so I would need to pull it out 
of our code base first.  It would be best if I could send it to someone, 
say Andrzej, to take a look at first.

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>
> [...]
>> Having a new index on each machine and having to create separate 
>> indexes is not the most elegant way to accomplish this architecture.  
>> The best way that we have found is to have an splitter job that 
>> indexes and splits the index and 
>
> Have you implemented a Lucene index splitter, i.e. a tool that takes 
> an existing Lucene index and splits it into parts by document id? This 
> sounds very interesting - could you tell us a bit about this?
>

Re: distributed search

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:

[...]
> Having a new index on each machine and having to create separate 
> indexes is not the most elegant way to accomplish this architecture.  
> The best way that we have found is to have an splitter job that 
> indexes and splits the index and 

Have you implemented a Lucene index splitter, i.e. a tool that takes an 
existing Lucene index and splits it into parts by document id? This 
sounds very interesting - could you tell us a bit about this?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: distributed search

Posted by Dennis Kubes <nu...@dragonflymc.com>.
The distributed searching section assumes that you have split the index 
into multiple pieces and there is a piece on each machine.  The tutorial 
doesn't tell you how to split the indexes because there is not tool to 
do that yet.  I was trying to layout a general architecture for how to 
do distributed searching instead of giving a step by step method.  What 
I would do for now is to create multiple indexes of say 2-4 million 
pages and put each index on a separate machine.  You would also need to 
copy all of the supporting database file such as the crawl db and link 
db to each machine.

Having a new index on each machine and having to create separate indexes 
is not the most elegant way to accomplish this architecture.  The best 
way that we have found is to have an splitter job that indexes and 
splits the index and supporting databases into multiple parts on the 
fly.  Then these parts are moved out to the search servers.  We have 
some base code for this but it is not in the nutch codebase as of yet.  
If you want to move down this path send me an email.

Dennis

Giuseppe Cannella wrote:
> in http://wiki.apache.org/nutch/NutchHadoopTutorial page
>
> at 'Distributed Searching' section 
> i read:
>
> "On each of the search servers you would use the startup the distributed search server by using the nutch server command like this:
> bin/nutch server 1234 /d01/local/crawled"
>
> but /d01/local/crawled has been created only for the first server, how could i create it for all server? 
> if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search finds N identical results (where N is how many servers are into the cluster)
>
>
> ------------------------------------------------------
> Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
> http://click.libero.it/infostrada25nov06
>
>
>