You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Stefan Groschupf <sg...@101tec.com> on 2008/04/04 04:44:34 UTC

Re: Lucene-based Distributed Index Leveraging Hadoop

Hi All,

we are also very much interested in such a system and actually have to  
realize such a system for an project within the next 3 month.
I would prefer to work on a open source solution instead of doing  
another one behind closed doors, though we would need to start coding  
pretty soon. We have 3 fulltime developers we could contribute for  
this time to such a project.

I'm happy to do all the organisational work like setting up the  
complete infrastructure etc to get it started.
I suggest we start with an sourceforge project since this is fast to  
setup and if we qualify for apache as an lucene or hadoop subproject  
migrate there later, or is it easy to start a apache incubator project?

We might just need a nice name for the project. Doug, any idea? :-)

Should we start from scratch or with a code contribution?
Someone still want to contribute its implementation?


Thanks.
Stefan







On Feb 6, 2008, at 10:57 AM, Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
> 1) Doug Cutting's "Index Server Project Proposal" at
>    http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
> 2) Solr's "Distributed Search" at
>    http://wiki.apache.org/solr/DistributedSearch
> 3) Mark Butler's "Distributed Lucene" at
>    http://wiki.apache.org/hadoop/DistributedLucene
>
> We have also been working on a Lucene-based distributed index  
> architecture.
> Our design differs from the above proposals in the way it leverages  
> Hadoop
> as much as possible. In particular, HDFS is used to reliably store  
> Lucene
> instances, Map/Reduce is used to analyze documents and update Lucene
> instances
> in parallel, and Hadoop's IPC framework is used. Our design is  
> geared for
> applications that require a highly scalable index and where batch  
> updates
> to each Lucene instance are acceptable (verses finer-grained  
> document at
> a time updates).
>
> We have a working implementation of our design and are in the process
> of evaluating its performance. An overview of our design is provided  
> below.
> We welcome feedback and would like to know if you are interested in  
> working
> on it. If so, we would be happy to make the code publicly available.  
> At the
> same time, we would like to collaborate with people working on  
> existing
> proposals and see if we can consolidate our efforts.
>
> TERMINOLOGY
> A distributed "index" is partitioned into "shards". Each shard  
> corresponds
> to
> a Lucene instance and contains a disjoint subset of the documents in  
> the
> index.
> Each shard is stored in HDFS and served by one or more "shard  
> servers". Here
> we only talk about a single distributed index, but in practice  
> multiple
> indexes
> can be supported.
>
> A "master" keeps track of the shard servers and the shards being  
> served by
> them. An "application" updates and queries the global index through an
> "index client". An index client communicates with the shard servers to
> execute a query.
>
> KEY RPC METHODS
> This section lists the key RPC methods in our design. To simplify the
> discussion, some of their parameters have been omitted.
>
>  On the Shard Servers
>    // Execute a query on this shard server's Lucene instance.
>    // This method is called by an index client.
>    SearchResults search(Query query);
>
>  On the Master
>    // Tell the master to update the shards, i.e., Lucene instances.
>    // This method is called by an index client.
>    boolean updateShards(Configuration conf);
>
>    // Ask the master where the shards are located.
>    // This method is called by an index client.
>    LocatedShards getShardLocations();
>
>    // Send a heartbeat to the master. This method is called by a
>    // shard server. In the response, the master informs the
>    // shard server when to switch to a newer version of the index.
>    ShardServerCommand sendHeartbeat();
>
> QUERYING THE INDEX
> To query the index, an application sends a search request to an index
> client.
> The index client then calls the shard server search() method for  
> each shard
> of the index, merges the results and returns them to the  
> application. The
> index client caches the mapping between shards and shard servers by
> periodically calling the master's getShardLocations() method.
>
> UPDATING THE INDEX USING MAP/REDUCE
> To update the index, an application sends an update request to an  
> index
> client.
> The index client then calls the master's updateShards() method, which
> schedules
> a Map/Reduce job to update the index. The Map/Reduce job updates the  
> shards
> in
> parallel and copies the new index files of each shard (i.e., Lucene
> instance)
> to HDFS.
>
> The updateShards() method includes a "configuration", which provides
> information for updating the shards. More specifically, the  
> configuration
> includes the following information:
>  - Input path. This provides the location of updated documents,  
> e.g., HDFS
>    files or directories, or HBase tables.
>  - Input formatter. This specifies how to format the input documents.
>  - Analysis. This defines the analyzer to use on the input. The  
> analyzer
>    determines whether a document is being inserted, updated, or  
> deleted.
> For
>    inserts or updates, the analyzer also converts each input  
> document into
>    a Lucene document.
>
> The Map phase of the Map/Reduce job formats and analyzes the input (in
> parallel), while the Reduce phase collects and applies the updates  
> to each
> Lucene instance (again in parallel). The updates are applied using  
> the local
> file system where a Reduce task runs and then copied back to HDFS. For
> example,
> if the updates caused a new Lucene segment to be created, the new  
> segment
> would be created on the local file system first, and then copied  
> back to
> HDFS.
>
> When the Map/Reduce job completes, a "new version" of the index is  
> ready to
> be
> queried. It is important to note that the new version of the index  
> is not
> derived from scratch. By leveraging Lucene's update algorithm, the new
> version
> of each Lucene instance will share as many files as possible as the  
> previous
> version.
>
> ENSURING INDEX CONSISTENCY
> At any point in time, an index client always has a consistent view  
> of the
> shards in the index. The results of a search query include either  
> all or
> none
> of a recent update to the index. The details of the algorithm to  
> accomplish
> this are omitted here, but the basic flow is pretty simple.
>
> After the Map/Reduce job to update the shards completes, the master  
> will
> tell
> each shard server to "prepare" the new version of the index. After  
> all the
> shard servers have responded affirmatively to the "prepare" message,  
> the new
>
> index is ready to be queried. An index client will then lazily learn  
> about
> the new index when it makes its next getShardLocations() call to the  
> master.
>
> In essence, a lazy two-phase commit protocol is used, with "prepare"  
> and
> "commit" messages piggybacked on heartbeats. After a shard has  
> switched to
> the new index, the Lucene files in the old index that are no longer  
> needed
> can safely be deleted.
>
> ACHIEVING FAULT-TOLERANCE
> We rely on the fault-tolerance of Map/Reduce to guarantee that an  
> index
> update
> will eventually succeed. All shards are stored in HDFS and can be  
> read by
> any
> shard server in a cluster. For a given shard, if one of its shard  
> servers
> dies,
> new search requests are handled by its surviving shard servers. To  
> ensure
> that
> there is always enough coverage for a shard, the master will  
> instruct other
> shard servers to take over the shards of a dead shard server.
>
> PERFORMANCE ISSUES
> Currently, each shard server reads a shard directly from HDFS.  
> Experiments
> have shown that this approach does not perform very well, with HDFS  
> causing
> Lucene to slow down fairly dramatically (by well over 5x when data  
> blocks
> are
> accessed over the network). Consequently, we are exploring different  
> ways to
> leverage the fault tolerance of HDFS and, at the same time, work  
> around its
> performance problems. One simple alternative is to add a local file  
> system
> cache on each shard server. Another alternative is to modify HDFS so  
> that an
> application has more control over where to store the primary and  
> replicas of
> an HDFS block. This feature may be useful for other HDFS  
> applications (e.g.,
> HBase). We would like to collaborate with other people who are  
> interested in
> adding this feature to HDFS.
>
>
> Regards,
> Ning Li

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com

RE: Lucene-based Distributed Index Leveraging Hadoop

Posted by marcus clemens <ma...@hotmail.com>.

message to Mark Butler . i am looking for candidates that have lucene exp for contract and permanent positions 
 
can you please send me your cv > Date: Thu, 21 Aug 2008 16:36:01 +0100> From: richard.marr@gmail.com> To: general@lucene.apache.org> Subject: Re: Lucene-based Distributed Index Leveraging Hadoop> > > Stefan Groschupf (4 Apr) wrote:> >> > I just noticed - to late though - Ning already contributed the code to> > hadoop. So I guess my question should be rephrased what is the idea of> > moving this into a own project?> > > Hi all,> > It was interesting to hear Mark Butler present his implementation of> Distributed Lucene at the Hadoop User Group meeting in London on> Tuesday. There's obviously been quite a bit of discussion on the> subject, and lots of interested parties. Mark, not sure if you're on> this list but thanks for sharing.> > Is this the forum to ask about open projects? I'm interested in> joining a project as long as it's goals aren't too distant to what I'm> looking for. Based mostly on gut feeling I'd rather go for a> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing> to be convinced otherwise.> > Rich
_________________________________________________________________
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.

2008/8/21 Stefan Groschupf <sg...@101tec.com>:
>
> Is there any material published about this? I would be very interested to
> see Marks slides and hear about the discussion.
>

In case anybody wants to see Marl's talk, the slides and video are here:
http://skillsmatter.com/podcast/home/distributed-lucene-for-hadoop

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.

Stefan,

I've got a lot of reading and learning to do  :o)

Thanks for the info, and good luck with your deployment.

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.

Hi,

> In terms of which project best fits my needs my gut feeling is that
> dlucene is pretty close. It supports incremental updates, and doesn't
> build in dependencies on systems like HDFS or Terracotta (I don't yet
> understand all the implications of those systems so would rather keep
> things simple if possible).

Upgrades...
The way we solve this with katta is that we simply deploy a new small  
index and use * in the client instead of a fixed index name.
Than once a night we merge all the small indexes (since this slows  
down things) together to a big new index.
To solve the problem of duplicate documents each document gets a  
timestamp and in the client we do a simple dedub based on a key and  
use always the latest document with the latest time stamp.

Dependencies...
Katta is independent of those technologies, it is lucene, zookeeper  
and hadoop RPI (instead of RMI, http or Apache Mina). Though we  
support loading index shards from a hadoop file system, but you also  
can load them from a mounted remote hdd NAS or what ever you like

> The obvious drawback being that dlucene
> doesn't seem to be an active public project.
Mark need to answer this but dlucene is checked in to the katta svn  
and I saw Marko checking in changes to dlucene. There was a discussion  
between Mark and me to bring dlucene and katta together and I really  
would love to see that happen but unfortunately we had a lot of  
pressure from our customer to deliver something so we had to focus on  
other things. More developers getting involved would clearly help  
here.. :-)

>
>
> Thanks for the reply Stefan. I'll certainly be taking a look through
> the code for Katta since no doubt there's a lot to learn in there.
Katta will be deployed into a production system of our customer in  
less than 4 weeks - so we working hard to iron out issues.
However katta is running since 6 weeks in a 10 node test environment  
with heavy load.

Stefan

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.

Stefan,

> Is there any material published about this? I would be very interested to
> see Marks slides and hear about the discussion.

I believe all the slides will be available. I'll post a link as soon
as I have one.

> Please keep in mind that katta is very young and compass or solr might be
> more interesting if you need something working now, though they might have
> different goals and focus than dlucene or katta.

I am looking to have something working relatively quickly, but my
performance needs and use cases are relatively modest (for now) so
some degree of string and sticky tape in the implementation is okay in
the short term. My main aim is to ensure that whatever I implement
scales horizontally without too much drama.

In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible). The obvious drawback being that dlucene
doesn't seem to be an active public project.

Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.

All the best...

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.

>
Hi All, Hi Mark,

> It was interesting to hear Mark Butler present his implementation of
> Distributed Lucene at the Hadoop User Group meeting in London on
> Tuesday. There's obviously been quite a bit of discussion on the
> subject, and lots of interested parties. Mark, not sure if you're on
> this list but thanks for sharing.

Is there any material published about this? I would be very interested  
to see Marks slides and hear about the discussion.


> Is this the forum to ask about open projects? I'm interested in
> joining a project as long as it's goals aren't too distant to what I'm
> looking for. Based mostly on gut feeling I'd rather go for a
> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
> to be convinced otherwise.

Rich, as you know there are a couple project in this area solar,  
compass, dlucene and katta and since all are open source I guess the  
easiest way to be involved is to join the mailing lists.

I only can speak for katta - we are very interested in getting more  
people involved to get other perspective. There is quite some activity  
in our project since our project is part of a upcoming production  
system, but low traffic in mailing list (So far all developers work in  
the same room).

You can find our mailing list on our source forge page:
http://katta.wiki.sourceforge.net/

Please keep in mind that katta is very young and compass or solr might  
be more interesting if you need something working now, though they  
might have different goals and focus than dlucene or katta.

Stefan Groschupf

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.

> Stefan Groschupf (4 Apr) wrote:
>
> I just noticed - to late though - Ning already contributed the code to
> hadoop. So I guess my question should be rephrased what is the idea of
> moving this into a own project?

Hi all,

It was interesting to hear Mark Butler present his implementation of
Distributed Lucene at the Hadoop User Group meeting in London on
Tuesday. There's obviously been quite a bit of discussion on the
subject, and lots of interested parties. Mark, not sure if you're on
this list but thanks for sharing.

Is this the forum to ask about open projects? I'm interested in
joining a project as long as it's goals aren't too distant to what I'm
looking for. Based mostly on gut feeling I'd rather go for a
stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
to be convinced otherwise.

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.

> Should we start from scratch or with a code contribution?
> Someone still want to contribute its implementation?
I just noticed - to late though - Ning already contributed the code to  
hadoop. So I guess my question should be rephrased what is the idea of  
moving this into a own project?