You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Eugeny N Dzhurinsky <bo...@redwerk.com> on 2007/12/13 10:17:52 UTC

map/reduce and Lucene integration question

Hello!

We would like to use Hadoop to index a lot of documents, and we would like to
have this index in the Lucene and utilize Lucene's search engine power for
searching.

At this point I am confused a bit - when we will analyze documents in Map
part, we will end with
- document name/location
- list of name/value pairs to be indexed by Lucene somehow

As far as I know I can write same key and different value to the
OutputCollector, however I'm not sure how can I pass list of name/value pairs
to the collector, or I need to think in some different way?

Another question is how can I write Lucene index in reduce part, since as far
as I know reduce can be invoked on any computer in cluster while Lucene index
requires to have non-DFS filesystem to store it's indexes and helper files?

I heard about Nutch it can use Map/Reduce to idnex documents and store them in
Lucene index, however quick look at it's sources didn't give me solid view of
how is it doing this and is it doing in this way I described at all?

Probably I'm missing something, so could somebody please point me to right
direction?

Thank you in advance!

-- 
Eugene N Dzhurinsky

Re: map/reduce and Lucene integration question

Posted by Ted Dunning <td...@veoh.com>.


Yes.



On 12/13/07 12:22 PM, "Eugeny N Dzhurinsky" <bo...@redwerk.com> wrote:

> On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote:
>> After indexing, indexes are moved to multiple query servers.  ... (how nutch
>> works) With this architecture, you get good scaling in both queries per
>> second and collection size and you maintain full HA.
> 
> Will that be correct if I would assume the things you described above are out
> of scope of Hadoop?

Re: map/reduce and Lucene integration question

Posted by Eugeny N Dzhurinsky <bo...@redwerk.com>.

On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote:
> After indexing, indexes are moved to multiple query servers.  The indexes on
> the local query servers are all on local disk.
> 
> There are two dimensions to scaling search.  The first dimension is query
> rate.  To get that scaling, you simply replicate your basic search operator
> and balance using a simple load balancer.
> 
> The second dimension is collection size.  If you have more than about 20
> million documents, you need to have several machines cooperate in a search.
> To scale in this dimension you have front end engines that do multi-searches
> against farms that each scale in the first dimension using load balancing.
> You need load balancing in front of your front end engines as well.
> 
> With this architecture, you get good scaling in both queries per second and
> collection size and you maintain full HA.

Will that be correct if I would assume the things you described above are out
of scope of Hadoop?

-- 
Eugene N Dzhurinsky

Re: map/reduce and Lucene integration question

Posted by Ted Dunning <td...@veoh.com>.

After indexing, indexes are moved to multiple query servers.  The indexes on
the local query servers are all on local disk.

There are two dimensions to scaling search.  The first dimension is query
rate.  To get that scaling, you simply replicate your basic search operator
and balance using a simple load balancer.

The second dimension is collection size.  If you have more than about 20
million documents, you need to have several machines cooperate in a search.
To scale in this dimension you have front end engines that do multi-searches
against farms that each scale in the first dimension using load balancing.
You need load balancing in front of your front end engines as well.

With this architecture, you get good scaling in both queries per second and
collection size and you maintain full HA.

On 12/13/07 11:18 AM, "Eugeny N Dzhurinsky" <bo...@redwerk.com> wrote:

> On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote:
>> 
>> I don't think so (but I don't run nutch)
>> 
>> To actually run searches, the search engines copy the index to local
>> storage.  Having them in HDFS is very nice, however, as a way to move them
>> to the right place.
> 
> Even in case if there is extremely fast network connection between nodes,
> moving indexes of several gigabytes of size seems to be very slow.
> 
> Is there any way to guarantee the request would be sent to certain data node
> which already holds required part of index, or guarantee the all reduce jobs
> will be running on same host and this way index will be located at the same
> host?
> 
> I feel like map/reduce is perfect way to index large set of documents, however
> I'm not sure how the searching will be performed later. I can think if the
> search request will be broadcasted to ALL nodes, each of node will take the
> search request, perform some search and return (or not) results which will be
> reduced later, however as far as I can see Hadoop will send the request to
> first node which seems to be free - but not necessary the same node which
> holds the index suitable for this request?

Re: map/reduce and Lucene integration question

Posted by Eugeny N Dzhurinsky <bo...@redwerk.com>.

On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote:
> 
> I don't think so (but I don't run nutch)
> 
> To actually run searches, the search engines copy the index to local
> storage.  Having them in HDFS is very nice, however, as a way to move them
> to the right place.

Even in case if there is extremely fast network connection between nodes,
moving indexes of several gigabytes of size seems to be very slow.

Is there any way to guarantee the request would be sent to certain data node
which already holds required part of index, or guarantee the all reduce jobs
will be running on same host and this way index will be located at the same
host?

I feel like map/reduce is perfect way to index large set of documents, however
I'm not sure how the searching will be performed later. I can think if the
search request will be broadcasted to ALL nodes, each of node will take the
search request, perform some search and return (or not) results which will be
reduced later, however as far as I can see Hadoop will send the request to
first node which seems to be free - but not necessary the same node which
holds the index suitable for this request?

-- 
Eugene N Dzhurinsky

Re: map/reduce and Lucene integration question

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ted Dunning wrote:
> I don't think so (but I don't run nutch)
> 
> To actually run searches, the search engines copy the index to local
> storage.  Having them in HDFS is very nice, however, as a way to move them
> to the right place.

Nutch can search in Lucene indexes on HDFS (see 
org.apache.nutch.indexer.FsDirectory). However, the performance hit is 
substantial, so the best practice is to copy indexes to a local FS.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: map/reduce and Lucene integration question

Posted by Ted Dunning <td...@veoh.com>.

I don't think so (but I don't run nutch)

To actually run searches, the search engines copy the index to local
storage.  Having them in HDFS is very nice, however, as a way to move them
to the right place.


On 12/13/07 10:59 AM, "Eugeny N Dzhurinsky" <bo...@redwerk.com> wrote:

> On Thu, Dec 13, 2007 at 11:36:31AM +0200, Enis Soztutar wrote:
>> Hi,
>> 
>> nutch indexes the documents in the org.apache.nutch.indexer.Indexer class.
>> In the reduce phase, the documents are output wrapped in ObjectWritable.
>> The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput()),
>> and adds all the documents that are collected. Then puts the index in
>> dfs(FileSystem.completeLocalOutput()). The resulting index has numReducer
>> partitions.
>> 
> 
> This means Lucene can work with indexes on DFS or nutch doesn't use Lucene?

Re: map/reduce and Lucene integration question

Posted by Eugeny N Dzhurinsky <bo...@redwerk.com>.

On Thu, Dec 13, 2007 at 11:36:31AM +0200, Enis Soztutar wrote:
> Hi,
> 
> nutch indexes the documents in the org.apache.nutch.indexer.Indexer class. 
> In the reduce phase, the documents are output wrapped in ObjectWritable. 
> The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput()), 
> and adds all the documents that are collected. Then puts the index in 
> dfs(FileSystem.completeLocalOutput()). The resulting index has numReducer 
> partitions.
> 

This means Lucene can work with indexes on DFS or nutch doesn't use Lucene?

-- 
Eugene N Dzhurinsky

RE: map/reduce and Lucene integration question

Posted by "Butler, Mark (Labs)" <ma...@hp.com>.

Hi team,

First off, I would like to express that I am very impressed with Hadoop and very grateful to everyone who has contributed to it and provided this software open source.

re: Lucene and Hadoop

I am in the process of implementing a Lucene distributed index (DLucene), based on the design that Doug Cutting outlines here

http://www.mail-archive.com/general@lucene.apache.org/msg00338.html

Current status: I've completed the master / worker implementations. I haven't yet implemented a throttling and garbage collection policy. I've also written unit tests. Next step is to write a system test and the client API. Also - and unfortunately this could take a little time - I need to get permission to release the code open source from a review board here at HP. This is in process, but with the lawyers (sigh). DLucene is not big, the code and unit tests are currently about 4000 lines.

Instead of using HDFS, the design of DLucene is inspired by HDFS. I decided not to use HDFS because it is optimized for a certain type of file, and the files in Lucene are a bit different. However I've tried to reuse code wherever I can.

There is no explicit integration with MapReduce at the moment. I wasn't aware of the way Nutch uses this, obviously it would be good to support Nutch here.

I've made some small changes to the API Doug outlined, if others are interested, I can post the revised interfaces, and it would be good to start a discussion about the client API as well? And perhaps how it could be used with MapReduce?

kind regards,

Mark

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com]
Sent: 13 December 2007 09:37
To: hadoop-user@lucene.apache.org
Subject: Re: map/reduce and Lucene integration question

Hi,

nutch indexes the documents in the org.apache.nutch.indexer.Indexer class. In the reduce phase, the documents are output wrapped in ObjectWritable. The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput()), and adds all the documents that are collected. Then puts the index in dfs(FileSystem.completeLocalOutput()). The resulting index has numReducer partitions.

Eugeny N Dzhurinsky wrote:
> Hello!
>
> We would like to use Hadoop to index a lot of documents, and we would
> like to have this index in the Lucene and utilize Lucene's search
> engine power for searching.
>
> At this point I am confused a bit - when we will analyze documents in
> Map part, we will end with
> - document name/location
> - list of name/value pairs to be indexed by Lucene somehow
>
> As far as I know I can write same key and different value to the
> OutputCollector, however I'm not sure how can I pass list of
> name/value pairs to the collector, or I need to think in some different way?
>
> Another question is how can I write Lucene index in reduce part, since
> as far as I know reduce can be invoked on any computer in cluster
> while Lucene index requires to have non-DFS filesystem to store it's indexes and helper files?
>
> I heard about Nutch it can use Map/Reduce to idnex documents and store
> them in Lucene index, however quick look at it's sources didn't give
> me solid view of how is it doing this and is it doing in this way I described at all?
>
> Probably I'm missing something, so could somebody please point me to
> right direction?
>
> Thank you in advance!
>
>

Re: map/reduce and Lucene integration question

Posted by Enis Soztutar <en...@gmail.com>.

Hi,

nutch indexes the documents in the org.apache.nutch.indexer.Indexer 
class. In the reduce phase, the documents are output wrapped in 
ObjectWritable. The OutputFormat opens a local 
indexwriter(FileSystem.startLocalOutput()), and adds all the documents 
that are collected. Then puts the index in 
dfs(FileSystem.completeLocalOutput()). The resulting index has 
numReducer partitions.

Eugeny N Dzhurinsky wrote:
> Hello!
>
> We would like to use Hadoop to index a lot of documents, and we would like to
> have this index in the Lucene and utilize Lucene's search engine power for
> searching.
>
> At this point I am confused a bit - when we will analyze documents in Map
> part, we will end with
> - document name/location
> - list of name/value pairs to be indexed by Lucene somehow
>
> As far as I know I can write same key and different value to the
> OutputCollector, however I'm not sure how can I pass list of name/value pairs
> to the collector, or I need to think in some different way?
>
> Another question is how can I write Lucene index in reduce part, since as far
> as I know reduce can be invoked on any computer in cluster while Lucene index
> requires to have non-DFS filesystem to store it's indexes and helper files?
>
> I heard about Nutch it can use Map/Reduce to idnex documents and store them in
> Lucene index, however quick look at it's sources didn't give me solid view of
> how is it doing this and is it doing in this way I described at all?
>
> Probably I'm missing something, so could somebody please point me to right
> direction?
>
> Thank you in advance!
>
>