You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Ning Li <ni...@gmail.com> on 2008/02/06 19:57:22 UTC

Lucene-based Distributed Index Leveraging Hadoop

There have been several proposals for a Lucene-based distributed index
architecture.
 1) Doug Cutting's "Index Server Project Proposal" at
    http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
 2) Solr's "Distributed Search" at
    http://wiki.apache.org/solr/DistributedSearch
 3) Mark Butler's "Distributed Lucene" at
    http://wiki.apache.org/hadoop/DistributedLucene

We have also been working on a Lucene-based distributed index architecture.
Our design differs from the above proposals in the way it leverages Hadoop
as much as possible. In particular, HDFS is used to reliably store Lucene
instances, Map/Reduce is used to analyze documents and update Lucene
instances
in parallel, and Hadoop's IPC framework is used. Our design is geared for
applications that require a highly scalable index and where batch updates
to each Lucene instance are acceptable (verses finer-grained document at
a time updates).

We have a working implementation of our design and are in the process
of evaluating its performance. An overview of our design is provided below.
We welcome feedback and would like to know if you are interested in working
on it. If so, we would be happy to make the code publicly available. At the
same time, we would like to collaborate with people working on existing
proposals and see if we can consolidate our efforts.

TERMINOLOGY
A distributed "index" is partitioned into "shards". Each shard corresponds
to
a Lucene instance and contains a disjoint subset of the documents in the
index.
Each shard is stored in HDFS and served by one or more "shard servers". Here
we only talk about a single distributed index, but in practice multiple
indexes
can be supported.

A "master" keeps track of the shard servers and the shards being served by
them. An "application" updates and queries the global index through an
"index client". An index client communicates with the shard servers to
execute a query.

KEY RPC METHODS
This section lists the key RPC methods in our design. To simplify the
discussion, some of their parameters have been omitted.

  On the Shard Servers
    // Execute a query on this shard server's Lucene instance.
    // This method is called by an index client.
    SearchResults search(Query query);

  On the Master
    // Tell the master to update the shards, i.e., Lucene instances.
    // This method is called by an index client.
    boolean updateShards(Configuration conf);

    // Ask the master where the shards are located.
    // This method is called by an index client.
    LocatedShards getShardLocations();

    // Send a heartbeat to the master. This method is called by a
    // shard server. In the response, the master informs the
    // shard server when to switch to a newer version of the index.
    ShardServerCommand sendHeartbeat();

QUERYING THE INDEX
To query the index, an application sends a search request to an index
client.
The index client then calls the shard server search() method for each shard
of the index, merges the results and returns them to the application. The
index client caches the mapping between shards and shard servers by
periodically calling the master's getShardLocations() method.

UPDATING THE INDEX USING MAP/REDUCE
To update the index, an application sends an update request to an index
client.
The index client then calls the master's updateShards() method, which
schedules
a Map/Reduce job to update the index. The Map/Reduce job updates the shards
in
parallel and copies the new index files of each shard (i.e., Lucene
instance)
to HDFS.

The updateShards() method includes a "configuration", which provides
information for updating the shards. More specifically, the configuration
includes the following information:
  - Input path. This provides the location of updated documents, e.g., HDFS
    files or directories, or HBase tables.
  - Input formatter. This specifies how to format the input documents.
  - Analysis. This defines the analyzer to use on the input. The analyzer
    determines whether a document is being inserted, updated, or deleted.
For
    inserts or updates, the analyzer also converts each input document into
    a Lucene document.

The Map phase of the Map/Reduce job formats and analyzes the input (in
parallel), while the Reduce phase collects and applies the updates to each
Lucene instance (again in parallel). The updates are applied using the local
file system where a Reduce task runs and then copied back to HDFS. For
example,
if the updates caused a new Lucene segment to be created, the new segment
would be created on the local file system first, and then copied back to
HDFS.

When the Map/Reduce job completes, a "new version" of the index is ready to
be
queried. It is important to note that the new version of the index is not
derived from scratch. By leveraging Lucene's update algorithm, the new
version
of each Lucene instance will share as many files as possible as the previous
version.

ENSURING INDEX CONSISTENCY
At any point in time, an index client always has a consistent view of the
shards in the index. The results of a search query include either all or
none
of a recent update to the index. The details of the algorithm to accomplish
this are omitted here, but the basic flow is pretty simple.

After the Map/Reduce job to update the shards completes, the master will
tell
each shard server to "prepare" the new version of the index. After all the
shard servers have responded affirmatively to the "prepare" message, the new

index is ready to be queried. An index client will then lazily learn about
the new index when it makes its next getShardLocations() call to the master.

In essence, a lazy two-phase commit protocol is used, with "prepare" and
"commit" messages piggybacked on heartbeats. After a shard has switched to
the new index, the Lucene files in the old index that are no longer needed
can safely be deleted.

ACHIEVING FAULT-TOLERANCE
We rely on the fault-tolerance of Map/Reduce to guarantee that an index
update
will eventually succeed. All shards are stored in HDFS and can be read by
any
shard server in a cluster. For a given shard, if one of its shard servers
dies,
new search requests are handled by its surviving shard servers. To ensure
that
there is always enough coverage for a shard, the master will instruct other
shard servers to take over the shards of a dead shard server.

PERFORMANCE ISSUES
Currently, each shard server reads a shard directly from HDFS. Experiments
have shown that this approach does not perform very well, with HDFS causing
Lucene to slow down fairly dramatically (by well over 5x when data blocks
are
accessed over the network). Consequently, we are exploring different ways to
leverage the fault tolerance of HDFS and, at the same time, work around its
performance problems. One simple alternative is to add a local file system
cache on each shard server. Another alternative is to modify HDFS so that an
application has more control over where to store the primary and replicas of
an HDFS block. This feature may be useful for other HDFS applications (e.g.,
HBase). We would like to collaborate with other people who are interested in
adding this feature to HDFS.


Regards,
Ning Li

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Srikant Jakilinki <sr...@bluebottle.com>.
Hi Ning,

In continuation with our offline conversation, here is a public 
expression of interest in your work and a description of our work. Sorry 
for the length in advance and I hope that the folk will be able to 
collaborate and/or share experiences and/or give us some pointers...

1) We are trying to leverage Lucene on Hadoop for blog archiving and 
searching i.e. ever-increasing data (in terabytes) on commodity hardware 
in a generic LAN. These machines are not hi-spec nor are dedicated but 
actually used within the lab by users for day to day tasks. 
Unfortunately, Nutch and Solr are not applicable to our situation - 
atleast directly. Think of us as an academic oriented Technorati

2) There are 2 aspects.One is that we want to archive the blogposts that 
we hit under a UUID/timestamp taxonomy. This archive can be used for 
many things like cached copies, diffing, surf acceleration etc. The 
other aspect is to archive the indexes. You see, the indexes have a 
lifecycle. For simplicity sake, an index consists of one days worth of 
blogposts (roughly, 15MM documents) and follow the <YYYYMMDD> taxonomy. 
Ideally, we want to store an indefinite archive of blogposts and their 
indexes side-by-side but 1 year or 365 days is a start

3) We want to use the taxonomical name of the post as a specific ID 
field in the Lucene index and want to get away with not storing the 
content of the post at all but only a file pointer/reference to it. This 
we hope will keep the index sizes low but the fact remains that this is 
a case of multiple threads on multiple JVMs handling multiple indexes on 
multiple machines. Further, the posts and indexes are mostly WORM but 
there may be situations where they have to be updated. For example, if 
some blog posts have edited content or have to be removed for copyright, 
or updated with metadata like rank. There is some duplication detection 
work that has to be done here but it is out of scope for now. And oh, 
the lab is a Linux-Windows environment

4) Our first port of call is to have Hadoop running on this group of 
machines (without clustering or load balancing or grid or master/slave 
mumbo jumbo) in the simplest way possible. The goal being to make 
applications see the bunch of machines as a reliable, scalable, 
fault-tolerant, average-performing file store with simple, file CRUD 
operations. For example, the blog crawler should be able to put the 
blogposts in this HDFS in live or in batch mode. With about 20 machines 
and each being installed with a 240GB drive for the experiment, we have 
about 4.5 TB of storage available

5) Next we want to handle Lucene and exploit the structure of its index 
and the algorithms behind it. Since a Lucene index is a directory of 
files, we intend to 'tag' the files as belonging to one index and store 
them on the HDFS. At any instant in time, an index can be regenerated 
and used. The regenerated index is however not used directly from HDFS 
but copied into the local filesystem of the indexer/searcher. This copy 
is subject to change and every once in a while, the constituent files in 
the HDFS are overwritten with the latest files. Hence, naming is quite 
important to us. Even so, we feel that the number of files that have to 
be updated are quite less and that we can use MD5 sums to make sure we 
only update the content changed files. However, this means that out of 
4.5 TB available, we use half of it for archival and the other half for 
searching. Even so, we should be able to store a years worth of posts 
and indexes. Disks are no problem

6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene file 
segments on the HDFS. Suppose there are N machines online, then each 
machine will have to own 365/N indexes. N constantly keeps changing but 
at any instant the 365 indexes should be live and we are working on the 
best way to achieve this kind of 'fair' autonomic computing cloud where 
when a machine goes down, the other machines will add some indexes to 
their kitty. If a machine is added, then it relieves other machines of 
some indexes. The searcher runs on each of these machines and is a 
service (IP:port) and queries are served using a ParallelMultiSearch() 
[on the machines] and a MultiSearch() [within the machines] so that we 
need not have an unmanageable number of JVMs per machine. Atmost, 1 for 
Hadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can be 
used for search if it supports multiple indexes available on the same 
machine

As you can see, this is not a simple endeavour and it is obvious, I 
suppose, that we are still in theory stage and only now getting to know 
the Lucene projects better. There is a huge body of work, albeit not 
acknowledged in the scientific community as it should be, and I want to 
say kudos to all who have been responsible for it.
I wish and hope to utilize the collective consciousness to mount our 
challenge. Any pointers, code, help, collaboration et al. for any of the 
6 points above - it goes with saying/asking - is welcome and look 
forward to share our experiences in a formal written discourse as and 
when we have them.

Cheers,
Srikant

Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>     http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
>

----------------------------------------------------------------------
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ian Holsman <li...@holsman.net>.
Ning Li wrote:
> One main focus is to provide fault-tolerance in this distributed index
> system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
> results from multiple shards right now. We'd like to start an open source
> project for a fault-tolerant distributed index system (or join if one
> already exists) if there is enough interest. Making Solr work on top of such
> a system could be an important goal and SOLR-303 is a big part of it in that
> case.
>   

I guess it depends on how you set up your shards in 303.
We plan on having a master/slave relationship on each shard, so that 
each shard would sync the same way solr does currently.

regards
Ian

> I should have made it clear that disjoint data sets are not a requirement of
> the system.
>
>
> On Feb 6, 2008 12:57 PM, Ian Holsman <li...@holsman.net> wrote:
>
>   
>> Hi.
>> AOL has a couple of projects going on in the lucene/hadoop/solr space,
>> and we will be pushing more stuff out as we can. We don't have anything
>> going with solr over hadoop at the moment.
>>
>> I'm not sure if this would be better than what SOLR-303 does, but you
>> should have a look at the work being done there.
>>
>> One of the things you mentioned is that the data sets are disjoint.
>> SOLR-303 doesn't require this, and allows us to have a document stored
>> in multiple shards (with different caching/update characteristics).
>>
>>
>>     
>
>   


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exists) if there is enough interest. Making Solr work on top of such
a system could be an important goal and SOLR-303 is a big part of it in that
case.

I should have made it clear that disjoint data sets are not a requirement of
the system.


On Feb 6, 2008 12:57 PM, Ian Holsman <li...@holsman.net> wrote:

> Hi.
> AOL has a couple of projects going on in the lucene/hadoop/solr space,
> and we will be pushing more stuff out as we can. We don't have anything
> going with solr over hadoop at the moment.
>
> I'm not sure if this would be better than what SOLR-303 does, but you
> should have a look at the work being done there.
>
> One of the things you mentioned is that the data sets are disjoint.
> SOLR-303 doesn't require this, and allows us to have a document stored
> in multiple shards (with different caching/update characteristics).
>
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exists) if there is enough interest. Making Solr work on top of such
a system could be an important goal and SOLR-303 is a big part of it in that
case.

I should have made it clear that disjoint data sets are not a requirement of
the system.


On Feb 6, 2008 12:57 PM, Ian Holsman <li...@holsman.net> wrote:

> Hi.
> AOL has a couple of projects going on in the lucene/hadoop/solr space,
> and we will be pushing more stuff out as we can. We don't have anything
> going with solr over hadoop at the moment.
>
> I'm not sure if this would be better than what SOLR-303 does, but you
> should have a look at the work being done there.
>
> One of the things you mentioned is that the data sets are disjoint.
> SOLR-303 doesn't require this, and allows us to have a document stored
> in multiple shards (with different caching/update characteristics).
>
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exists) if there is enough interest. Making Solr work on top of such
a system could be an important goal and SOLR-303 is a big part of it in that
case.

I should have made it clear that disjoint data sets are not a requirement of
the system.


On Feb 6, 2008 12:57 PM, Ian Holsman <li...@holsman.net> wrote:

> Hi.
> AOL has a couple of projects going on in the lucene/hadoop/solr space,
> and we will be pushing more stuff out as we can. We don't have anything
> going with solr over hadoop at the moment.
>
> I'm not sure if this would be better than what SOLR-303 does, but you
> should have a look at the work being done there.
>
> One of the things you mentioned is that the data sets are disjoint.
> SOLR-303 doesn't require this, and allows us to have a document stored
> in multiple shards (with different caching/update characteristics).
>
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ian Holsman <li...@holsman.net>.
Clay Webster wrote:
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?  
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had 
> talked about it a bunch -- but that was long ago. 
>   

Hi.
AOL has a couple of projects going on in the lucene/hadoop/solr space, 
and we will be pushing more stuff out as we can. We don't have anything 
going with solr over hadoop at the moment.

I'm not sure if this would be better than what SOLR-303 does, but you 
should have a look at the work being done there.

One of the things you mentioned is that the data sets are disjoint. 
SOLR-303 doesn't require this, and allows us to have a document stored 
in multiple shards (with different caching/update characteristics).
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>   
>
>
>   


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ian Holsman <li...@holsman.net>.
Clay Webster wrote:
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?  
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had 
> talked about it a bunch -- but that was long ago. 
>   

Hi.
AOL has a couple of projects going on in the lucene/hadoop/solr space, 
and we will be pushing more stuff out as we can. We don't have anything 
going with solr over hadoop at the moment.

I'm not sure if this would be better than what SOLR-303 does, but you 
should have a look at the work being done there.

One of the things you mentioned is that the data sets are disjoint. 
SOLR-303 doesn't require this, and allows us to have a document stored 
in multiple shards (with different caching/update characteristics).
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>   
>
>
>   


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ian Holsman <li...@holsman.net>.
Clay Webster wrote:
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?  
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had 
> talked about it a bunch -- but that was long ago. 
>   

Hi.
AOL has a couple of projects going on in the lucene/hadoop/solr space, 
and we will be pushing more stuff out as we can. We don't have anything 
going with solr over hadoop at the moment.

I'm not sure if this would be better than what SOLR-303 does, but you 
should have a look at the work being done there.

One of the things you mentioned is that the data sets are disjoint. 
SOLR-303 doesn't require this, and allows us to have a document stored 
in multiple shards (with different caching/update characteristics).
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>   
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by "J. Delgado" <jd...@lendingclub.com>.
I'm pretty sure that what you describe is the case, specially taking into
consideration that PageRank (what drives their search results) is a per
document value that is probably recomputed after some long time interval. I
did see a MapReduce algorithm to compute PageRank as well. However I do
think they must be distributing the query load across many many machines.

I also think that limiting flat results of the top 10 and then do paging is
optimized for performance. Yet another reason why Google has not implemented
facets browsing or real-time clustering around their result set.

J.D.

On Feb 6, 2008 4:22 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> (trimming excessive cc-s)
>
> Ning Li wrote:
> > No. I'm curious too. :)
> >
> > On Feb 6, 2008 11:44 AM, J. Delgado <jd...@lendingclub.com> wrote:
> >
> >> I assume that Google also has distributed index over their
> >> GFS/MapReduce implementation. Any idea how they achieve this?
>
> I'm pretty sure that MapReduce/GFS/BigTable is used only for creating
> the index (as well as crawling, data mining, web graph analysis, static
> scoring etc). The overhead of MR jobs is just too high.
>
> Their impressive search response times are most likely the result of
> extensive caching of pre-computed partial hit lists for frequent terms
> and phrases - at least that's what I suspect after reading this paper
> (not by Google folks, but very enlightening):
> http://citeseer.ist.psu.edu/724464.html
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
(trimming excessive cc-s)

Ning Li wrote:
> No. I'm curious too. :)
> 
> On Feb 6, 2008 11:44 AM, J. Delgado <jd...@lendingclub.com> wrote:
> 
>> I assume that Google also has distributed index over their
>> GFS/MapReduce implementation. Any idea how they achieve this?

I'm pretty sure that MapReduce/GFS/BigTable is used only for creating 
the index (as well as crawling, data mining, web graph analysis, static 
scoring etc). The overhead of MR jobs is just too high.

Their impressive search response times are most likely the result of 
extensive caching of pre-computed partial hit lists for frequent terms 
and phrases - at least that's what I suspect after reading this paper 
(not by Google folks, but very enlightening): 
http://citeseer.ist.psu.edu/724464.html

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado <jd...@lendingclub.com> wrote:

> I assume that Google also has distributed index over their
> GFS/MapReduce implementation. Any idea how they achieve this?
>
> J.D.
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado <jd...@lendingclub.com> wrote:

> I assume that Google also has distributed index over their
> GFS/MapReduce implementation. Any idea how they achieve this?
>
> J.D.
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado <jd...@lendingclub.com> wrote:

> I assume that Google also has distributed index over their
> GFS/MapReduce implementation. Any idea how they achieve this?
>
> J.D.
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by "J. Delgado" <jd...@lendingclub.com>.
I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?

J.D.



On Feb 6, 2008 11:33 AM, Clay Webster <cl...@cnet.com> wrote:
>
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had
> talked about it a bunch -- but that was long ago.
>
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>
> > -----Original Message-----
> > From: Ning Li [mailto:ning.li.li@gmail.com]
> > Sent: Wednesday, February 06, 2008 1:57 PM
> > To: general@lucene.apache.org; java-dev@lucene.apache.org; solr-
> > user@lucene.apache.org
> > Subject: Lucene-based Distributed Index Leveraging Hadoop
> >
> > There have been several proposals for a Lucene-based distributed index
> > architecture.
> >  1) Doug Cutting's "Index Server Project Proposal" at
> >
> http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
> >  2) Solr's "Distributed Search" at
> >     http://wiki.apache.org/solr/DistributedSearch
> >  3) Mark Butler's "Distributed Lucene" at
> >     http://wiki.apache.org/hadoop/DistributedLucene
> >
> > We have also been working on a Lucene-based distributed index
> > architecture.
> > Our design differs from the above proposals in the way it leverages
> > Hadoop
> > as much as possible. In particular, HDFS is used to reliably store
> > Lucene
> > instances, Map/Reduce is used to analyze documents and update Lucene
> > instances
> > in parallel, and Hadoop's IPC framework is used. Our design is geared
> > for
> > applications that require a highly scalable index and where batch
> > updates
> > to each Lucene instance are acceptable (verses finer-grained document
> > at
> > a time updates).
> >
> > We have a working implementation of our design and are in the process
> > of evaluating its performance. An overview of our design is provided
> > below.
> > We welcome feedback and would like to know if you are interested in
> > working
> > on it. If so, we would be happy to make the code publicly available.
> At
> > the
> > same time, we would like to collaborate with people working on
> existing
> > proposals and see if we can consolidate our efforts.
> >
> > TERMINOLOGY
> > A distributed "index" is partitioned into "shards". Each shard
> > corresponds
> > to
> > a Lucene instance and contains a disjoint subset of the documents in
> > the
> > index.
> > Each shard is stored in HDFS and served by one or more "shard
> servers".
> > Here
> > we only talk about a single distributed index, but in practice
> multiple
> > indexes
> > can be supported.
> >
> > A "master" keeps track of the shard servers and the shards being
> served
> > by
> > them. An "application" updates and queries the global index through an
> > "index client". An index client communicates with the shard servers to
> > execute a query.
> >
> > KEY RPC METHODS
> > This section lists the key RPC methods in our design. To simplify the
> > discussion, some of their parameters have been omitted.
> >
> >   On the Shard Servers
> >     // Execute a query on this shard server's Lucene instance.
> >     // This method is called by an index client.
> >     SearchResults search(Query query);
> >
> >   On the Master
> >     // Tell the master to update the shards, i.e., Lucene instances.
> >     // This method is called by an index client.
> >     boolean updateShards(Configuration conf);
> >
> >     // Ask the master where the shards are located.
> >     // This method is called by an index client.
> >     LocatedShards getShardLocations();
> >
> >     // Send a heartbeat to the master. This method is called by a
> >     // shard server. In the response, the master informs the
> >     // shard server when to switch to a newer version of the index.
> >     ShardServerCommand sendHeartbeat();
> >
> > QUERYING THE INDEX
> > To query the index, an application sends a search request to an index
> > client.
> > The index client then calls the shard server search() method for each
> > shard
> > of the index, merges the results and returns them to the application.
> > The
> > index client caches the mapping between shards and shard servers by
> > periodically calling the master's getShardLocations() method.
> >
> > UPDATING THE INDEX USING MAP/REDUCE
> > To update the index, an application sends an update request to an
> index
> > client.
> > The index client then calls the master's updateShards() method, which
> > schedules
> > a Map/Reduce job to update the index. The Map/Reduce job updates the
> > shards
> > in
> > parallel and copies the new index files of each shard (i.e., Lucene
> > instance)
> > to HDFS.
> >
> > The updateShards() method includes a "configuration", which provides
> > information for updating the shards. More specifically, the
> > configuration
> > includes the following information:
> >   - Input path. This provides the location of updated documents, e.g.,
> > HDFS
> >     files or directories, or HBase tables.
> >   - Input formatter. This specifies how to format the input documents.
> >   - Analysis. This defines the analyzer to use on the input. The
> > analyzer
> >     determines whether a document is being inserted, updated, or
> > deleted.
> > For
> >     inserts or updates, the analyzer also converts each input document
> > into
> >     a Lucene document.
> >
> > The Map phase of the Map/Reduce job formats and analyzes the input (in
> > parallel), while the Reduce phase collects and applies the updates to
> > each
> > Lucene instance (again in parallel). The updates are applied using the
> > local
> > file system where a Reduce task runs and then copied back to HDFS. For
> > example,
> > if the updates caused a new Lucene segment to be created, the new
> > segment
> > would be created on the local file system first, and then copied back
> > to
> > HDFS.
> >
> > When the Map/Reduce job completes, a "new version" of the index is
> > ready to
> > be
> > queried. It is important to note that the new version of the index is
> > not
> > derived from scratch. By leveraging Lucene's update algorithm, the new
> > version
> > of each Lucene instance will share as many files as possible as the
> > previous
> > version.
> >
> > ENSURING INDEX CONSISTENCY
> > At any point in time, an index client always has a consistent view of
> > the
> > shards in the index. The results of a search query include either all
> > or
> > none
> > of a recent update to the index. The details of the algorithm to
> > accomplish
> > this are omitted here, but the basic flow is pretty simple.
> >
> > After the Map/Reduce job to update the shards completes, the master
> > will
> > tell
> > each shard server to "prepare" the new version of the index. After all
> > the
> > shard servers have responded affirmatively to the "prepare" message,
> > the new
> >
> > index is ready to be queried. An index client will then lazily learn
> > about
> > the new index when it makes its next getShardLocations() call to the
> > master.
> >
> > In essence, a lazy two-phase commit protocol is used, with "prepare"
> > and
> > "commit" messages piggybacked on heartbeats. After a shard has
> switched
> > to
> > the new index, the Lucene files in the old index that are no longer
> > needed
> > can safely be deleted.
> >
> > ACHIEVING FAULT-TOLERANCE
> > We rely on the fault-tolerance of Map/Reduce to guarantee that an
> index
> > update
> > will eventually succeed. All shards are stored in HDFS and can be read
> > by
> > any
> > shard server in a cluster. For a given shard, if one of its shard
> > servers
> > dies,
> > new search requests are handled by its surviving shard servers. To
> > ensure
> > that
> > there is always enough coverage for a shard, the master will instruct
> > other
> > shard servers to take over the shards of a dead shard server.
> >
> > PERFORMANCE ISSUES
> > Currently, each shard server reads a shard directly from HDFS.
> > Experiments
> > have shown that this approach does not perform very well, with HDFS
> > causing
> > Lucene to slow down fairly dramatically (by well over 5x when data
> > blocks
> > are
> > accessed over the network). Consequently, we are exploring different
> > ways to
> > leverage the fault tolerance of HDFS and, at the same time, work
> around
> > its
> > performance problems. One simple alternative is to add a local file
> > system
> > cache on each shard server. Another alternative is to modify HDFS so
> > that an
> > application has more control over where to store the primary and
> > replicas of
> > an HDFS block. This feature may be useful for other HDFS applications
> > (e.g.,
> > HBase). We would like to collaborate with other people who are
> > interested in
> > adding this feature to HDFS.
> >
> >
> > Regards,
> > Ning Li
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust
has a similar design. Happy to see an existing application on such a system.
Do they plan to open-source it? Is the AOL project an open source project?

On Feb 6, 2008 11:33 AM, Clay Webster <cl...@cnet.com> wrote:

>
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had
> talked about it a bunch -- but that was long ago.
>
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by "J. Delgado" <jd...@lendingclub.com>.
I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?

J.D.



On Feb 6, 2008 11:33 AM, Clay Webster <cl...@cnet.com> wrote:
>
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had
> talked about it a bunch -- but that was long ago.
>
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>
> > -----Original Message-----
> > From: Ning Li [mailto:ning.li.li@gmail.com]
> > Sent: Wednesday, February 06, 2008 1:57 PM
> > To: general@lucene.apache.org; java-dev@lucene.apache.org; solr-
> > user@lucene.apache.org
> > Subject: Lucene-based Distributed Index Leveraging Hadoop
> >
> > There have been several proposals for a Lucene-based distributed index
> > architecture.
> >  1) Doug Cutting's "Index Server Project Proposal" at
> >
> http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
> >  2) Solr's "Distributed Search" at
> >     http://wiki.apache.org/solr/DistributedSearch
> >  3) Mark Butler's "Distributed Lucene" at
> >     http://wiki.apache.org/hadoop/DistributedLucene
> >
> > We have also been working on a Lucene-based distributed index
> > architecture.
> > Our design differs from the above proposals in the way it leverages
> > Hadoop
> > as much as possible. In particular, HDFS is used to reliably store
> > Lucene
> > instances, Map/Reduce is used to analyze documents and update Lucene
> > instances
> > in parallel, and Hadoop's IPC framework is used. Our design is geared
> > for
> > applications that require a highly scalable index and where batch
> > updates
> > to each Lucene instance are acceptable (verses finer-grained document
> > at
> > a time updates).
> >
> > We have a working implementation of our design and are in the process
> > of evaluating its performance. An overview of our design is provided
> > below.
> > We welcome feedback and would like to know if you are interested in
> > working
> > on it. If so, we would be happy to make the code publicly available.
> At
> > the
> > same time, we would like to collaborate with people working on
> existing
> > proposals and see if we can consolidate our efforts.
> >
> > TERMINOLOGY
> > A distributed "index" is partitioned into "shards". Each shard
> > corresponds
> > to
> > a Lucene instance and contains a disjoint subset of the documents in
> > the
> > index.
> > Each shard is stored in HDFS and served by one or more "shard
> servers".
> > Here
> > we only talk about a single distributed index, but in practice
> multiple
> > indexes
> > can be supported.
> >
> > A "master" keeps track of the shard servers and the shards being
> served
> > by
> > them. An "application" updates and queries the global index through an
> > "index client". An index client communicates with the shard servers to
> > execute a query.
> >
> > KEY RPC METHODS
> > This section lists the key RPC methods in our design. To simplify the
> > discussion, some of their parameters have been omitted.
> >
> >   On the Shard Servers
> >     // Execute a query on this shard server's Lucene instance.
> >     // This method is called by an index client.
> >     SearchResults search(Query query);
> >
> >   On the Master
> >     // Tell the master to update the shards, i.e., Lucene instances.
> >     // This method is called by an index client.
> >     boolean updateShards(Configuration conf);
> >
> >     // Ask the master where the shards are located.
> >     // This method is called by an index client.
> >     LocatedShards getShardLocations();
> >
> >     // Send a heartbeat to the master. This method is called by a
> >     // shard server. In the response, the master informs the
> >     // shard server when to switch to a newer version of the index.
> >     ShardServerCommand sendHeartbeat();
> >
> > QUERYING THE INDEX
> > To query the index, an application sends a search request to an index
> > client.
> > The index client then calls the shard server search() method for each
> > shard
> > of the index, merges the results and returns them to the application.
> > The
> > index client caches the mapping between shards and shard servers by
> > periodically calling the master's getShardLocations() method.
> >
> > UPDATING THE INDEX USING MAP/REDUCE
> > To update the index, an application sends an update request to an
> index
> > client.
> > The index client then calls the master's updateShards() method, which
> > schedules
> > a Map/Reduce job to update the index. The Map/Reduce job updates the
> > shards
> > in
> > parallel and copies the new index files of each shard (i.e., Lucene
> > instance)
> > to HDFS.
> >
> > The updateShards() method includes a "configuration", which provides
> > information for updating the shards. More specifically, the
> > configuration
> > includes the following information:
> >   - Input path. This provides the location of updated documents, e.g.,
> > HDFS
> >     files or directories, or HBase tables.
> >   - Input formatter. This specifies how to format the input documents.
> >   - Analysis. This defines the analyzer to use on the input. The
> > analyzer
> >     determines whether a document is being inserted, updated, or
> > deleted.
> > For
> >     inserts or updates, the analyzer also converts each input document
> > into
> >     a Lucene document.
> >
> > The Map phase of the Map/Reduce job formats and analyzes the input (in
> > parallel), while the Reduce phase collects and applies the updates to
> > each
> > Lucene instance (again in parallel). The updates are applied using the
> > local
> > file system where a Reduce task runs and then copied back to HDFS. For
> > example,
> > if the updates caused a new Lucene segment to be created, the new
> > segment
> > would be created on the local file system first, and then copied back
> > to
> > HDFS.
> >
> > When the Map/Reduce job completes, a "new version" of the index is
> > ready to
> > be
> > queried. It is important to note that the new version of the index is
> > not
> > derived from scratch. By leveraging Lucene's update algorithm, the new
> > version
> > of each Lucene instance will share as many files as possible as the
> > previous
> > version.
> >
> > ENSURING INDEX CONSISTENCY
> > At any point in time, an index client always has a consistent view of
> > the
> > shards in the index. The results of a search query include either all
> > or
> > none
> > of a recent update to the index. The details of the algorithm to
> > accomplish
> > this are omitted here, but the basic flow is pretty simple.
> >
> > After the Map/Reduce job to update the shards completes, the master
> > will
> > tell
> > each shard server to "prepare" the new version of the index. After all
> > the
> > shard servers have responded affirmatively to the "prepare" message,
> > the new
> >
> > index is ready to be queried. An index client will then lazily learn
> > about
> > the new index when it makes its next getShardLocations() call to the
> > master.
> >
> > In essence, a lazy two-phase commit protocol is used, with "prepare"
> > and
> > "commit" messages piggybacked on heartbeats. After a shard has
> switched
> > to
> > the new index, the Lucene files in the old index that are no longer
> > needed
> > can safely be deleted.
> >
> > ACHIEVING FAULT-TOLERANCE
> > We rely on the fault-tolerance of Map/Reduce to guarantee that an
> index
> > update
> > will eventually succeed. All shards are stored in HDFS and can be read
> > by
> > any
> > shard server in a cluster. For a given shard, if one of its shard
> > servers
> > dies,
> > new search requests are handled by its surviving shard servers. To
> > ensure
> > that
> > there is always enough coverage for a shard, the master will instruct
> > other
> > shard servers to take over the shards of a dead shard server.
> >
> > PERFORMANCE ISSUES
> > Currently, each shard server reads a shard directly from HDFS.
> > Experiments
> > have shown that this approach does not perform very well, with HDFS
> > causing
> > Lucene to slow down fairly dramatically (by well over 5x when data
> > blocks
> > are
> > accessed over the network). Consequently, we are exploring different
> > ways to
> > leverage the fault tolerance of HDFS and, at the same time, work
> around
> > its
> > performance problems. One simple alternative is to add a local file
> > system
> > cache on each shard server. Another alternative is to modify HDFS so
> > that an
> > application has more control over where to store the primary and
> > replicas of
> > an HDFS block. This feature may be useful for other HDFS applications
> > (e.g.,
> > HBase). We would like to collaborate with other people who are
> > interested in
> > adding this feature to HDFS.
> >
> >
> > Regards,
> > Ning Li
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust
has a similar design. Happy to see an existing application on such a system.
Do they plan to open-source it? Is the AOL project an open source project?

On Feb 6, 2008 11:33 AM, Clay Webster <cl...@cnet.com> wrote:

>
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had
> talked about it a bunch -- but that was long ago.
>
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust
has a similar design. Happy to see an existing application on such a system.
Do they plan to open-source it? Is the AOL project an open source project?

On Feb 6, 2008 11:33 AM, Clay Webster <cl...@cnet.com> wrote:

>
> There seem to be a few other players in this space too.
>
> Are you from Rackspace?
> (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
> query-terabytes-data)
>
> AOL also has a Hadoop/Solr project going on.
>
> CNET does not have much brewing there.  Although Yonik and I had
> talked about it a bunch -- but that was long ago.
>
> --cw
>
> Clay Webster                                   tel:1.908.541.3724
> Associate VP, Platform Infrastructure         http://www.cnet.com
> CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com
>
>

RE: Lucene-based Distributed Index Leveraging Hadoop

Posted by Clay Webster <cl...@cnet.com>.
There seem to be a few other players in this space too.

Are you from Rackspace?  
(http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
query-terabytes-data)

AOL also has a Hadoop/Solr project going on.

CNET does not have much brewing there.  Although Yonik and I had 
talked about it a bunch -- but that was long ago. 

--cw

Clay Webster                                   tel:1.908.541.3724
Associate VP, Platform Infrastructure         http://www.cnet.com
CNET, Inc. (Nasdaq:CNET)                     mailto:clay@cnet.com

> -----Original Message-----
> From: Ning Li [mailto:ning.li.li@gmail.com]
> Sent: Wednesday, February 06, 2008 1:57 PM
> To: general@lucene.apache.org; java-dev@lucene.apache.org; solr-
> user@lucene.apache.org
> Subject: Lucene-based Distributed Index Leveraging Hadoop
> 
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>
http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
> 
> We have also been working on a Lucene-based distributed index
> architecture.
> Our design differs from the above proposals in the way it leverages
> Hadoop
> as much as possible. In particular, HDFS is used to reliably store
> Lucene
> instances, Map/Reduce is used to analyze documents and update Lucene
> instances
> in parallel, and Hadoop's IPC framework is used. Our design is geared
> for
> applications that require a highly scalable index and where batch
> updates
> to each Lucene instance are acceptable (verses finer-grained document
> at
> a time updates).
> 
> We have a working implementation of our design and are in the process
> of evaluating its performance. An overview of our design is provided
> below.
> We welcome feedback and would like to know if you are interested in
> working
> on it. If so, we would be happy to make the code publicly available.
At
> the
> same time, we would like to collaborate with people working on
existing
> proposals and see if we can consolidate our efforts.
> 
> TERMINOLOGY
> A distributed "index" is partitioned into "shards". Each shard
> corresponds
> to
> a Lucene instance and contains a disjoint subset of the documents in
> the
> index.
> Each shard is stored in HDFS and served by one or more "shard
servers".
> Here
> we only talk about a single distributed index, but in practice
multiple
> indexes
> can be supported.
> 
> A "master" keeps track of the shard servers and the shards being
served
> by
> them. An "application" updates and queries the global index through an
> "index client". An index client communicates with the shard servers to
> execute a query.
> 
> KEY RPC METHODS
> This section lists the key RPC methods in our design. To simplify the
> discussion, some of their parameters have been omitted.
> 
>   On the Shard Servers
>     // Execute a query on this shard server's Lucene instance.
>     // This method is called by an index client.
>     SearchResults search(Query query);
> 
>   On the Master
>     // Tell the master to update the shards, i.e., Lucene instances.
>     // This method is called by an index client.
>     boolean updateShards(Configuration conf);
> 
>     // Ask the master where the shards are located.
>     // This method is called by an index client.
>     LocatedShards getShardLocations();
> 
>     // Send a heartbeat to the master. This method is called by a
>     // shard server. In the response, the master informs the
>     // shard server when to switch to a newer version of the index.
>     ShardServerCommand sendHeartbeat();
> 
> QUERYING THE INDEX
> To query the index, an application sends a search request to an index
> client.
> The index client then calls the shard server search() method for each
> shard
> of the index, merges the results and returns them to the application.
> The
> index client caches the mapping between shards and shard servers by
> periodically calling the master's getShardLocations() method.
> 
> UPDATING THE INDEX USING MAP/REDUCE
> To update the index, an application sends an update request to an
index
> client.
> The index client then calls the master's updateShards() method, which
> schedules
> a Map/Reduce job to update the index. The Map/Reduce job updates the
> shards
> in
> parallel and copies the new index files of each shard (i.e., Lucene
> instance)
> to HDFS.
> 
> The updateShards() method includes a "configuration", which provides
> information for updating the shards. More specifically, the
> configuration
> includes the following information:
>   - Input path. This provides the location of updated documents, e.g.,
> HDFS
>     files or directories, or HBase tables.
>   - Input formatter. This specifies how to format the input documents.
>   - Analysis. This defines the analyzer to use on the input. The
> analyzer
>     determines whether a document is being inserted, updated, or
> deleted.
> For
>     inserts or updates, the analyzer also converts each input document
> into
>     a Lucene document.
> 
> The Map phase of the Map/Reduce job formats and analyzes the input (in
> parallel), while the Reduce phase collects and applies the updates to
> each
> Lucene instance (again in parallel). The updates are applied using the
> local
> file system where a Reduce task runs and then copied back to HDFS. For
> example,
> if the updates caused a new Lucene segment to be created, the new
> segment
> would be created on the local file system first, and then copied back
> to
> HDFS.
> 
> When the Map/Reduce job completes, a "new version" of the index is
> ready to
> be
> queried. It is important to note that the new version of the index is
> not
> derived from scratch. By leveraging Lucene's update algorithm, the new
> version
> of each Lucene instance will share as many files as possible as the
> previous
> version.
> 
> ENSURING INDEX CONSISTENCY
> At any point in time, an index client always has a consistent view of
> the
> shards in the index. The results of a search query include either all
> or
> none
> of a recent update to the index. The details of the algorithm to
> accomplish
> this are omitted here, but the basic flow is pretty simple.
> 
> After the Map/Reduce job to update the shards completes, the master
> will
> tell
> each shard server to "prepare" the new version of the index. After all
> the
> shard servers have responded affirmatively to the "prepare" message,
> the new
> 
> index is ready to be queried. An index client will then lazily learn
> about
> the new index when it makes its next getShardLocations() call to the
> master.
> 
> In essence, a lazy two-phase commit protocol is used, with "prepare"
> and
> "commit" messages piggybacked on heartbeats. After a shard has
switched
> to
> the new index, the Lucene files in the old index that are no longer
> needed
> can safely be deleted.
> 
> ACHIEVING FAULT-TOLERANCE
> We rely on the fault-tolerance of Map/Reduce to guarantee that an
index
> update
> will eventually succeed. All shards are stored in HDFS and can be read
> by
> any
> shard server in a cluster. For a given shard, if one of its shard
> servers
> dies,
> new search requests are handled by its surviving shard servers. To
> ensure
> that
> there is always enough coverage for a shard, the master will instruct
> other
> shard servers to take over the shards of a dead shard server.
> 
> PERFORMANCE ISSUES
> Currently, each shard server reads a shard directly from HDFS.
> Experiments
> have shown that this approach does not perform very well, with HDFS
> causing
> Lucene to slow down fairly dramatically (by well over 5x when data
> blocks
> are
> accessed over the network). Consequently, we are exploring different
> ways to
> leverage the fault tolerance of HDFS and, at the same time, work
around
> its
> performance problems. One simple alternative is to add a local file
> system
> cache on each shard server. Another alternative is to modify HDFS so
> that an
> application has more control over where to store the primary and
> replicas of
> an HDFS block. This feature may be useful for other HDFS applications
> (e.g.,
> HBase). We would like to collaborate with other people who are
> interested in
> adding this feature to HDFS.
> 
> 
> Regards,
> Ning Li

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Srikant Jakilinki <sr...@bluebottle.com>.
Hi Ning,

In continuation with our offline conversation, here is a public 
expression of interest in your work and a description of our work. Sorry 
for the length in advance and I hope that the folk will be able to 
collaborate and/or share experiences and/or give us some pointers...

1) We are trying to leverage Lucene on Hadoop for blog archiving and 
searching i.e. ever-increasing data (in terabytes) on commodity hardware 
in a generic LAN. These machines are not hi-spec nor are dedicated but 
actually used within the lab by users for day to day tasks. 
Unfortunately, Nutch and Solr are not applicable to our situation - 
atleast directly. Think of us as an academic oriented Technorati

2) There are 2 aspects.One is that we want to archive the blogposts that 
we hit under a UUID/timestamp taxonomy. This archive can be used for 
many things like cached copies, diffing, surf acceleration etc. The 
other aspect is to archive the indexes. You see, the indexes have a 
lifecycle. For simplicity sake, an index consists of one days worth of 
blogposts (roughly, 15MM documents) and follow the <YYYYMMDD> taxonomy. 
Ideally, we want to store an indefinite archive of blogposts and their 
indexes side-by-side but 1 year or 365 days is a start

3) We want to use the taxonomical name of the post as a specific ID 
field in the Lucene index and want to get away with not storing the 
content of the post at all but only a file pointer/reference to it. This 
we hope will keep the index sizes low but the fact remains that this is 
a case of multiple threads on multiple JVMs handling multiple indexes on 
multiple machines. Further, the posts and indexes are mostly WORM but 
there may be situations where they have to be updated. For example, if 
some blog posts have edited content or have to be removed for copyright, 
or updated with metadata like rank. There is some duplication detection 
work that has to be done here but it is out of scope for now. And oh, 
the lab is a Linux-Windows environment

4) Our first port of call is to have Hadoop running on this group of 
machines (without clustering or load balancing or grid or master/slave 
mumbo jumbo) in the simplest way possible. The goal being to make 
applications see the bunch of machines as a reliable, scalable, 
fault-tolerant, average-performing file store with simple, file CRUD 
operations. For example, the blog crawler should be able to put the 
blogposts in this HDFS in live or in batch mode. With about 20 machines 
and each being installed with a 240GB drive for the experiment, we have 
about 4.5 TB of storage available

5) Next we want to handle Lucene and exploit the structure of its index 
and the algorithms behind it. Since a Lucene index is a directory of 
files, we intend to 'tag' the files as belonging to one index and store 
them on the HDFS. At any instant in time, an index can be regenerated 
and used. The regenerated index is however not used directly from HDFS 
but copied into the local filesystem of the indexer/searcher. This copy 
is subject to change and every once in a while, the constituent files in 
the HDFS are overwritten with the latest files. Hence, naming is quite 
important to us. Even so, we feel that the number of files that have to 
be updated are quite less and that we can use MD5 sums to make sure we 
only update the content changed files. However, this means that out of 
4.5 TB available, we use half of it for archival and the other half for 
searching. Even so, we should be able to store a years worth of posts 
and indexes. Disks are no problem

6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene file 
segments on the HDFS. Suppose there are N machines online, then each 
machine will have to own 365/N indexes. N constantly keeps changing but 
at any instant the 365 indexes should be live and we are working on the 
best way to achieve this kind of 'fair' autonomic computing cloud where 
when a machine goes down, the other machines will add some indexes to 
their kitty. If a machine is added, then it relieves other machines of 
some indexes. The searcher runs on each of these machines and is a 
service (IP:port) and queries are served using a ParallelMultiSearch() 
[on the machines] and a MultiSearch() [within the machines] so that we 
need not have an unmanageable number of JVMs per machine. Atmost, 1 for 
Hadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can be 
used for search if it supports multiple indexes available on the same 
machine

As you can see, this is not a simple endeavour and it is obvious, I 
suppose, that we are still in theory stage and only now getting to know 
the Lucene projects better. There is a huge body of work, albeit not 
acknowledged in the scientific community as it should be, and I want to 
say kudos to all who have been responsible for it.
I wish and hope to utilize the collective consciousness to mount our 
challenge. Any pointers, code, help, collaboration et al. for any of the 
6 points above - it goes with saying/asking - is welcome and look 
forward to share our experiences in a formal written discourse as and 
when we have them.

Cheers,
Srikant

Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>     http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
>

----------------------------------------------------------------------
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Srikant Jakilinki <sr...@bluebottle.com>.
Hi Ning,

In continuation with our offline conversation, here is a public 
expression of interest in your work and a description of our work. Sorry 
for the length in advance and I hope that the folk will be able to 
collaborate and/or share experiences and/or give us some pointers...

1) We are trying to leverage Lucene on Hadoop for blog archiving and 
searching i.e. ever-increasing data (in terabytes) on commodity hardware 
in a generic LAN. These machines are not hi-spec nor are dedicated but 
actually used within the lab by users for day to day tasks. 
Unfortunately, Nutch and Solr are not applicable to our situation - 
atleast directly. Think of us as an academic oriented Technorati

2) There are 2 aspects.One is that we want to archive the blogposts that 
we hit under a UUID/timestamp taxonomy. This archive can be used for 
many things like cached copies, diffing, surf acceleration etc. The 
other aspect is to archive the indexes. You see, the indexes have a 
lifecycle. For simplicity sake, an index consists of one days worth of 
blogposts (roughly, 15MM documents) and follow the <YYYYMMDD> taxonomy. 
Ideally, we want to store an indefinite archive of blogposts and their 
indexes side-by-side but 1 year or 365 days is a start

3) We want to use the taxonomical name of the post as a specific ID 
field in the Lucene index and want to get away with not storing the 
content of the post at all but only a file pointer/reference to it. This 
we hope will keep the index sizes low but the fact remains that this is 
a case of multiple threads on multiple JVMs handling multiple indexes on 
multiple machines. Further, the posts and indexes are mostly WORM but 
there may be situations where they have to be updated. For example, if 
some blog posts have edited content or have to be removed for copyright, 
or updated with metadata like rank. There is some duplication detection 
work that has to be done here but it is out of scope for now. And oh, 
the lab is a Linux-Windows environment

4) Our first port of call is to have Hadoop running on this group of 
machines (without clustering or load balancing or grid or master/slave 
mumbo jumbo) in the simplest way possible. The goal being to make 
applications see the bunch of machines as a reliable, scalable, 
fault-tolerant, average-performing file store with simple, file CRUD 
operations. For example, the blog crawler should be able to put the 
blogposts in this HDFS in live or in batch mode. With about 20 machines 
and each being installed with a 240GB drive for the experiment, we have 
about 4.5 TB of storage available

5) Next we want to handle Lucene and exploit the structure of its index 
and the algorithms behind it. Since a Lucene index is a directory of 
files, we intend to 'tag' the files as belonging to one index and store 
them on the HDFS. At any instant in time, an index can be regenerated 
and used. The regenerated index is however not used directly from HDFS 
but copied into the local filesystem of the indexer/searcher. This copy 
is subject to change and every once in a while, the constituent files in 
the HDFS are overwritten with the latest files. Hence, naming is quite 
important to us. Even so, we feel that the number of files that have to 
be updated are quite less and that we can use MD5 sums to make sure we 
only update the content changed files. However, this means that out of 
4.5 TB available, we use half of it for archival and the other half for 
searching. Even so, we should be able to store a years worth of posts 
and indexes. Disks are no problem

6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene file 
segments on the HDFS. Suppose there are N machines online, then each 
machine will have to own 365/N indexes. N constantly keeps changing but 
at any instant the 365 indexes should be live and we are working on the 
best way to achieve this kind of 'fair' autonomic computing cloud where 
when a machine goes down, the other machines will add some indexes to 
their kitty. If a machine is added, then it relieves other machines of 
some indexes. The searcher runs on each of these machines and is a 
service (IP:port) and queries are served using a ParallelMultiSearch() 
[on the machines] and a MultiSearch() [within the machines] so that we 
need not have an unmanageable number of JVMs per machine. Atmost, 1 for 
Hadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can be 
used for search if it supports multiple indexes available on the same 
machine

As you can see, this is not a simple endeavour and it is obvious, I 
suppose, that we are still in theory stage and only now getting to know 
the Lucene projects better. There is a huge body of work, albeit not 
acknowledged in the scientific community as it should be, and I want to 
say kudos to all who have been responsible for it.
I wish and hope to utilize the collective consciousness to mount our 
challenge. Any pointers, code, help, collaboration et al. for any of the 
6 points above - it goes with saying/asking - is welcome and look 
forward to share our experiences in a formal written discourse as and 
when we have them.

Cheers,
Srikant

Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>     http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
>

----------------------------------------------------------------------
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Doug Cutting <cu...@apache.org>.
[ No longer cross-posting to java-dev and solr-user. ]

Andrzej Bialecki wrote:
>> A particular client should be able to provide a consistent read/write 
>> view by bonding to particular replicas of a shard.  Thus a user who 
>> makes a modification should be able to generally see that modification 
>> in results immediately, while other users, talking to different 
>> replicas, may not see it until synchronization is complete.
> 
> This requires that we use versioning, and that we have a "shard manager" 
> that knows the latest versions of each shard among the whole active set 
> - or that clients discover this dynamically by querying the shard 
> servers every now and then.

Yes, there needs to be a master that knows the shard hash function. 
However, I'm not sure what you mean by "versioning".  In general, there 
is no "latest" version of a shard.  Different shards have had different 
updates, and must, between themselves, resolve conflicts.  A client 
would generally talk to just one replica of each shard.  This is like 
CouchDB.  If different fields of a document are modified on different 
shards, then the changes can be merged.  Edits to a text field might 
sometimes even be mergable.  But, in general, if two shards both contain 
unmergable changes to the same field, one will win and one will lose. 
Similarly, if a document id is deleted in one shard and added in another 
at approximately the same time, then the addition would generally win. 
Thus if a single client switches which shard replica it talks to, then 
it could possibly lose deletions.  Or if different clients attempt to 
modify the same document, one clients changes may be overwritten by the 
other.  This is similar to the way that Amazon's Dynamo works: in the 
case of failures, shopping cart deletions can be lost, and deleted 
things may thus re-appear in one's shopping cart.  This happens rarely, 
and confirmation is required before final sale, so it is not a big 
problem.  Perhaps conflicts can be flagged and manually resolved by the 
application.  Or perhaps clocks can be sufficiently synchronized that 
the vast majority of conflicts can be automatically resolved correctly.

Doug

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Tim Jones <tj...@apache.org>.
I am guessing that the idea behind not putting the indexes in HDFS is  
(1) maximize performance; (2) they are relatively transient - meaning  
the data they are created from could be in HDFS, but the indexes  
themselves are just local.  To avoid having to recreate them, a backup  
copy could be kept in HDFS.

Since a goal is to be able to update them (frequently), this seems  
like a good approach to me.

Tim


Andrzej Bialecki wrote:
> Doug Cutting wrote:
>> My primary difference with your proposal is that I would like to  
>> support online indexing.  Documents could be inserted and removed  
>> directly, and shards would synchronize changes amongst replicas,  
>> with an "eventual consistency" model.  Indexes would not be stored  
>> in HDFS, but directly on the local disk of each node.  Hadoop would  
>> perhaps not play a role. In many ways this would resemble CouchDB,  
>> but with explicit support for sharding and failover from the outset.
>
> It's true that searching over HDFS is slow - but I'd hate to lose  
> all other HDFS benefits and have to start from scratch ... I wonder  
> what would be the performance of FsDirectory over an HDFS index that  
> is "pinned" to a local disk, i.e. a full local replica is available,  
> with block size of each index file equal to the file size.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Ning,
> 
> I am also interested in starting a new project in this area.  The 
> approach I have in mind is slightly different, but hopefully we can come 
> to some agreement and collaborate.

I'm interested in this too.

> My current thinking is that the Solr search API is the appropriate 
> model.  Solr's facets are an important feature that require low-level 
> support to be practical.  Thus a useful distributed search system should 
> support facets from the outset, rather than attempt to graft them on 
> later.  In particular, I believe this requirement mandates disjoint shards.

I agree - shards should be disjoint also because if we eventually want 
to manage multiple replicas of each shard across the cluster (for 
reliability and performance) then overlapping documents would complicate 
both the query dispatching process and the merging of partial result sets.


> My primary difference with your proposal is that I would like to support 
> online indexing.  Documents could be inserted and removed directly, and 
> shards would synchronize changes amongst replicas, with an "eventual 
> consistency" model.  Indexes would not be stored in HDFS, but directly 
> on the local disk of each node.  Hadoop would perhaps not play a role. 
> In many ways this would resemble CouchDB, but with explicit support for 
> sharding and failover from the outset.

It's true that searching over HDFS is slow - but I'd hate to lose all 
other HDFS benefits and have to start from scratch ... I wonder what 
would be the performance of FsDirectory over an HDFS index that is 
"pinned" to a local disk, i.e. a full local replica is available, with 
block size of each index file equal to the file size.


> A particular client should be able to provide a consistent read/write 
> view by bonding to particular replicas of a shard.  Thus a user who 
> makes a modification should be able to generally see that modification 
> in results immediately, while other users, talking to different 
> replicas, may not see it until synchronization is complete.

This requires that we use versioning, and that we have a "shard manager" 
that knows the latest versions of each shard among the whole active set 
- or that clients discover this dynamically by querying the shard 
servers every now and then.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Ning Li <ni...@gmail.com>.
Doug,

I'm looking forward to the collaboration!

> My current thinking is that the Solr search API is the appropriate
> model.  Solr's facets are an important feature that require low-level

I'm thinking, can we make the type of shard updater/searcher and
result merger configurable in a general distributed index system?
Vanilla Lucene is one type. Solr is another. Nutch could have one.
Applications can write their customized type (must be Lucene-based).
In case of a Solr-typed system, for example, an application sends
a search request to an index client. The index client sends the search
request to shard servers which host Solr searchers. The index client
uses the Solr result merger to merge the results from all the shards
and returns the merged result to the application.

> My primary difference with your proposal is that I would like to support
> online indexing.  Documents could be inserted and removed directly, and
> shards would synchronize changes amongst replicas, with an "eventual
> consistency" model.

I've been thinking about batch update vs. online update. :)
Is it possible to support both efficiently in one system?

We may say that a system which supports online update can
handle batch update. However, it depends on whether the updates
on a shard server are lost when the server goes down. In a
system targeting batch update, the entirety of a batch update
can simply be guaranteed by a map/reduce job.

Your thoughts?

The online update you described here is different from the one
you described in the Index Server Project proposal a while ago.
It was multi-reader single-writer before. Now it's multi-reader
multi-writer with eventual consistency. Is it because it is a more
general usage scenario that you think that latter supports?

Regards,
Ning

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Ning,
> 
> I am also interested in starting a new project in this area.  The 
> approach I have in mind is slightly different, but hopefully we can come 
> to some agreement and collaborate.

I'm interested in this too.

> My current thinking is that the Solr search API is the appropriate 
> model.  Solr's facets are an important feature that require low-level 
> support to be practical.  Thus a useful distributed search system should 
> support facets from the outset, rather than attempt to graft them on 
> later.  In particular, I believe this requirement mandates disjoint shards.

I agree - shards should be disjoint also because if we eventually want 
to manage multiple replicas of each shard across the cluster (for 
reliability and performance) then overlapping documents would complicate 
both the query dispatching process and the merging of partial result sets.


> My primary difference with your proposal is that I would like to support 
> online indexing.  Documents could be inserted and removed directly, and 
> shards would synchronize changes amongst replicas, with an "eventual 
> consistency" model.  Indexes would not be stored in HDFS, but directly 
> on the local disk of each node.  Hadoop would perhaps not play a role. 
> In many ways this would resemble CouchDB, but with explicit support for 
> sharding and failover from the outset.

It's true that searching over HDFS is slow - but I'd hate to lose all 
other HDFS benefits and have to start from scratch ... I wonder what 
would be the performance of FsDirectory over an HDFS index that is 
"pinned" to a local disk, i.e. a full local replica is available, with 
block size of each index file equal to the file size.


> A particular client should be able to provide a consistent read/write 
> view by bonding to particular replicas of a shard.  Thus a user who 
> makes a modification should be able to generally see that modification 
> in results immediately, while other users, talking to different 
> replicas, may not see it until synchronization is complete.

This requires that we use versioning, and that we have a "shard manager" 
that knows the latest versions of each shard among the whole active set 
- or that clients discover this dynamically by querying the shard 
servers every now and then.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Ning,
> 
> I am also interested in starting a new project in this area.  The 
> approach I have in mind is slightly different, but hopefully we can come 
> to some agreement and collaborate.

I'm interested in this too.

> My current thinking is that the Solr search API is the appropriate 
> model.  Solr's facets are an important feature that require low-level 
> support to be practical.  Thus a useful distributed search system should 
> support facets from the outset, rather than attempt to graft them on 
> later.  In particular, I believe this requirement mandates disjoint shards.

I agree - shards should be disjoint also because if we eventually want 
to manage multiple replicas of each shard across the cluster (for 
reliability and performance) then overlapping documents would complicate 
both the query dispatching process and the merging of partial result sets.


> My primary difference with your proposal is that I would like to support 
> online indexing.  Documents could be inserted and removed directly, and 
> shards would synchronize changes amongst replicas, with an "eventual 
> consistency" model.  Indexes would not be stored in HDFS, but directly 
> on the local disk of each node.  Hadoop would perhaps not play a role. 
> In many ways this would resemble CouchDB, but with explicit support for 
> sharding and failover from the outset.

It's true that searching over HDFS is slow - but I'd hate to lose all 
other HDFS benefits and have to start from scratch ... I wonder what 
would be the performance of FsDirectory over an HDFS index that is 
"pinned" to a local disk, i.e. a full local replica is available, with 
block size of each index file equal to the file size.


> A particular client should be able to provide a consistent read/write 
> view by bonding to particular replicas of a shard.  Thus a user who 
> makes a modification should be able to generally see that modification 
> in results immediately, while other users, talking to different 
> replicas, may not see it until synchronization is complete.

This requires that we use versioning, and that we have a "shard manager" 
that knows the latest versions of each shard among the whole active set 
- or that clients discover this dynamically by querying the shard 
servers every now and then.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Doug Cutting <cu...@apache.org>.
Ning,

I am also interested in starting a new project in this area.  The 
approach I have in mind is slightly different, but hopefully we can come 
to some agreement and collaborate.

My current thinking is that the Solr search API is the appropriate 
model.  Solr's facets are an important feature that require low-level 
support to be practical.  Thus a useful distributed search system should 
support facets from the outset, rather than attempt to graft them on 
later.  In particular, I believe this requirement mandates disjoint shards.

My primary difference with your proposal is that I would like to support 
online indexing.  Documents could be inserted and removed directly, and 
shards would synchronize changes amongst replicas, with an "eventual 
consistency" model.  Indexes would not be stored in HDFS, but directly 
on the local disk of each node.  Hadoop would perhaps not play a role. 
In many ways this would resemble CouchDB, but with explicit support for 
sharding and failover from the outset.

A particular client should be able to provide a consistent read/write 
view by bonding to particular replicas of a shard.  Thus a user who 
makes a modification should be able to generally see that modification 
in results immediately, while other users, talking to different 
replicas, may not see it until synchronization is complete.

There are many unresolved issues in my mind around sharding and 
replication that I hope to reach some clarity on before beginning 
implementation.  Does this sound like it could be of interest to you?

Cheers,

Doug

Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>     http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
> 
> We have also been working on a Lucene-based distributed index architecture.
> Our design differs from the above proposals in the way it leverages Hadoop
> as much as possible. In particular, HDFS is used to reliably store Lucene
> instances, Map/Reduce is used to analyze documents and update Lucene
> instances
> in parallel, and Hadoop's IPC framework is used. Our design is geared for
> applications that require a highly scalable index and where batch updates
> to each Lucene instance are acceptable (verses finer-grained document at
> a time updates).
> 
> We have a working implementation of our design and are in the process
> of evaluating its performance. An overview of our design is provided below.
> We welcome feedback and would like to know if you are interested in working
> on it. If so, we would be happy to make the code publicly available. At the
> same time, we would like to collaborate with people working on existing
> proposals and see if we can consolidate our efforts.
> 
> TERMINOLOGY
> A distributed "index" is partitioned into "shards". Each shard corresponds
> to
> a Lucene instance and contains a disjoint subset of the documents in the
> index.
> Each shard is stored in HDFS and served by one or more "shard servers". Here
> we only talk about a single distributed index, but in practice multiple
> indexes
> can be supported.
> 
> A "master" keeps track of the shard servers and the shards being served by
> them. An "application" updates and queries the global index through an
> "index client". An index client communicates with the shard servers to
> execute a query.
> 
> KEY RPC METHODS
> This section lists the key RPC methods in our design. To simplify the
> discussion, some of their parameters have been omitted.
> 
>   On the Shard Servers
>     // Execute a query on this shard server's Lucene instance.
>     // This method is called by an index client.
>     SearchResults search(Query query);
> 
>   On the Master
>     // Tell the master to update the shards, i.e., Lucene instances.
>     // This method is called by an index client.
>     boolean updateShards(Configuration conf);
> 
>     // Ask the master where the shards are located.
>     // This method is called by an index client.
>     LocatedShards getShardLocations();
> 
>     // Send a heartbeat to the master. This method is called by a
>     // shard server. In the response, the master informs the
>     // shard server when to switch to a newer version of the index.
>     ShardServerCommand sendHeartbeat();
> 
> QUERYING THE INDEX
> To query the index, an application sends a search request to an index
> client.
> The index client then calls the shard server search() method for each shard
> of the index, merges the results and returns them to the application. The
> index client caches the mapping between shards and shard servers by
> periodically calling the master's getShardLocations() method.
> 
> UPDATING THE INDEX USING MAP/REDUCE
> To update the index, an application sends an update request to an index
> client.
> The index client then calls the master's updateShards() method, which
> schedules
> a Map/Reduce job to update the index. The Map/Reduce job updates the shards
> in
> parallel and copies the new index files of each shard (i.e., Lucene
> instance)
> to HDFS.
> 
> The updateShards() method includes a "configuration", which provides
> information for updating the shards. More specifically, the configuration
> includes the following information:
>   - Input path. This provides the location of updated documents, e.g., HDFS
>     files or directories, or HBase tables.
>   - Input formatter. This specifies how to format the input documents.
>   - Analysis. This defines the analyzer to use on the input. The analyzer
>     determines whether a document is being inserted, updated, or deleted.
> For
>     inserts or updates, the analyzer also converts each input document into
>     a Lucene document.
> 
> The Map phase of the Map/Reduce job formats and analyzes the input (in
> parallel), while the Reduce phase collects and applies the updates to each
> Lucene instance (again in parallel). The updates are applied using the local
> file system where a Reduce task runs and then copied back to HDFS. For
> example,
> if the updates caused a new Lucene segment to be created, the new segment
> would be created on the local file system first, and then copied back to
> HDFS.
> 
> When the Map/Reduce job completes, a "new version" of the index is ready to
> be
> queried. It is important to note that the new version of the index is not
> derived from scratch. By leveraging Lucene's update algorithm, the new
> version
> of each Lucene instance will share as many files as possible as the previous
> version.
> 
> ENSURING INDEX CONSISTENCY
> At any point in time, an index client always has a consistent view of the
> shards in the index. The results of a search query include either all or
> none
> of a recent update to the index. The details of the algorithm to accomplish
> this are omitted here, but the basic flow is pretty simple.
> 
> After the Map/Reduce job to update the shards completes, the master will
> tell
> each shard server to "prepare" the new version of the index. After all the
> shard servers have responded affirmatively to the "prepare" message, the new
> 
> index is ready to be queried. An index client will then lazily learn about
> the new index when it makes its next getShardLocations() call to the master.
> 
> In essence, a lazy two-phase commit protocol is used, with "prepare" and
> "commit" messages piggybacked on heartbeats. After a shard has switched to
> the new index, the Lucene files in the old index that are no longer needed
> can safely be deleted.
> 
> ACHIEVING FAULT-TOLERANCE
> We rely on the fault-tolerance of Map/Reduce to guarantee that an index
> update
> will eventually succeed. All shards are stored in HDFS and can be read by
> any
> shard server in a cluster. For a given shard, if one of its shard servers
> dies,
> new search requests are handled by its surviving shard servers. To ensure
> that
> there is always enough coverage for a shard, the master will instruct other
> shard servers to take over the shards of a dead shard server.
> 
> PERFORMANCE ISSUES
> Currently, each shard server reads a shard directly from HDFS. Experiments
> have shown that this approach does not perform very well, with HDFS causing
> Lucene to slow down fairly dramatically (by well over 5x when data blocks
> are
> accessed over the network). Consequently, we are exploring different ways to
> leverage the fault tolerance of HDFS and, at the same time, work around its
> performance problems. One simple alternative is to add a local file system
> cache on each shard server. Another alternative is to modify HDFS so that an
> application has more control over where to store the primary and replicas of
> an HDFS block. This feature may be useful for other HDFS applications (e.g.,
> HBase). We would like to collaborate with other people who are interested in
> adding this feature to HDFS.
> 
> 
> Regards,
> Ning Li
> 


RE: Lucene-based Distributed Index Leveraging Hadoop

Posted by marcus clemens <ma...@hotmail.com>.
message to Mark Butler . i am looking for candidates that have lucene exp for contract and permanent positions 
 
can you please send me your cv > Date: Thu, 21 Aug 2008 16:36:01 +0100> From: richard.marr@gmail.com> To: general@lucene.apache.org> Subject: Re: Lucene-based Distributed Index Leveraging Hadoop> > > Stefan Groschupf (4 Apr) wrote:> >> > I just noticed - to late though - Ning already contributed the code to> > hadoop. So I guess my question should be rephrased what is the idea of> > moving this into a own project?> > > Hi all,> > It was interesting to hear Mark Butler present his implementation of> Distributed Lucene at the Hadoop User Group meeting in London on> Tuesday. There's obviously been quite a bit of discussion on the> subject, and lots of interested parties. Mark, not sure if you're on> this list but thanks for sharing.> > Is this the forum to ask about open projects? I'm interested in> joining a project as long as it's goals aren't too distant to what I'm> looking for. Based mostly on gut feeling I'd rather go for a> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing> to be convinced otherwise.> > Rich
_________________________________________________________________
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.
2008/8/21 Stefan Groschupf <sg...@101tec.com>:
>
> Is there any material published about this? I would be very interested to
> see Marks slides and hear about the discussion.
>

In case anybody wants to see Marl's talk, the slides and video are here:
http://skillsmatter.com/podcast/home/distributed-lucene-for-hadoop

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.
Stefan,

I've got a lot of reading and learning to do  :o)

Thanks for the info, and good luck with your deployment.

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.
Hi,

> In terms of which project best fits my needs my gut feeling is that
> dlucene is pretty close. It supports incremental updates, and doesn't
> build in dependencies on systems like HDFS or Terracotta (I don't yet
> understand all the implications of those systems so would rather keep
> things simple if possible).

Upgrades...
The way we solve this with katta is that we simply deploy a new small  
index and use * in the client instead of a fixed index name.
Than once a night we merge all the small indexes (since this slows  
down things) together to a big new index.
To solve the problem of duplicate documents each document gets a  
timestamp and in the client we do a simple dedub based on a key and  
use always the latest document with the latest time stamp.

Dependencies...
Katta is independent of those technologies, it is lucene, zookeeper  
and hadoop RPI (instead of RMI, http or Apache Mina). Though we  
support loading index shards from a hadoop file system, but you also  
can load them from a mounted remote hdd NAS or what ever you like

> The obvious drawback being that dlucene
> doesn't seem to be an active public project.
Mark need to answer this but dlucene is checked in to the katta svn  
and I saw Marko checking in changes to dlucene. There was a discussion  
between Mark and me to bring dlucene and katta together and I really  
would love to see that happen but unfortunately we had a lot of  
pressure from our customer to deliver something so we had to focus on  
other things. More developers getting involved would clearly help  
here.. :-)

>
>
> Thanks for the reply Stefan. I'll certainly be taking a look through
> the code for Katta since no doubt there's a lot to learn in there.
Katta will be deployed into a production system of our customer in  
less than 4 weeks - so we working hard to iron out issues.
However katta is running since 6 weeks in a 10 node test environment  
with heavy load.

Stefan 

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.
Stefan,

> Is there any material published about this? I would be very interested to
> see Marks slides and hear about the discussion.

I believe all the slides will be available. I'll post a link as soon
as I have one.

> Please keep in mind that katta is very young and compass or solr might be
> more interesting if you need something working now, though they might have
> different goals and focus than dlucene or katta.

I am looking to have something working relatively quickly, but my
performance needs and use cases are relatively modest (for now) so
some degree of string and sticky tape in the implementation is okay in
the short term. My main aim is to ensure that whatever I implement
scales horizontally without too much drama.

In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible). The obvious drawback being that dlucene
doesn't seem to be an active public project.

Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.

All the best...

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.
>
Hi All, Hi Mark,

> It was interesting to hear Mark Butler present his implementation of
> Distributed Lucene at the Hadoop User Group meeting in London on
> Tuesday. There's obviously been quite a bit of discussion on the
> subject, and lots of interested parties. Mark, not sure if you're on
> this list but thanks for sharing.

Is there any material published about this? I would be very interested  
to see Marks slides and hear about the discussion.


> Is this the forum to ask about open projects? I'm interested in
> joining a project as long as it's goals aren't too distant to what I'm
> looking for. Based mostly on gut feeling I'd rather go for a
> stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
> to be convinced otherwise.

Rich, as you know there are a couple project in this area solar,  
compass, dlucene and katta and since all are open source I guess the  
easiest way to be involved is to join the mailing lists.

I only can speak for katta - we are very interested in getting more  
people involved to get other perspective. There is quite some activity  
in our project since our project is part of a upcoming production  
system, but low traffic in mailing list (So far all developers work in  
the same room).

You can find our mailing list on our source forge page:
http://katta.wiki.sourceforge.net/

Please keep in mind that katta is very young and compass or solr might  
be more interesting if you need something working now, though they  
might have different goals and focus than dlucene or katta.

Stefan Groschupf

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Richard Marr <ri...@gmail.com>.
> Stefan Groschupf (4 Apr) wrote:
>
> I just noticed - to late though - Ning already contributed the code to
> hadoop. So I guess my question should be rephrased what is the idea of
> moving this into a own project?


Hi all,

It was interesting to hear Mark Butler present his implementation of
Distributed Lucene at the Hadoop User Group meeting in London on
Tuesday. There's obviously been quite a bit of discussion on the
subject, and lots of interested parties. Mark, not sure if you're on
this list but thanks for sharing.

Is this the forum to ask about open projects? I'm interested in
joining a project as long as it's goals aren't too distant to what I'm
looking for. Based mostly on gut feeling I'd rather go for a
stand-alone project that wasn't dependent on HDFS/Hadoop, but willing
to be convinced otherwise.

Rich

Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.
> Should we start from scratch or with a code contribution?
> Someone still want to contribute its implementation?
I just noticed - to late though - Ning already contributed the code to  
hadoop. So I guess my question should be rephrased what is the idea of  
moving this into a own project?


Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.
Hi All,

we are also very much interested in such a system and actually have to  
realize such a system for an project within the next 3 month.
I would prefer to work on a open source solution instead of doing  
another one behind closed doors, though we would need to start coding  
pretty soon. We have 3 fulltime developers we could contribute for  
this time to such a project.

I'm happy to do all the organisational work like setting up the  
complete infrastructure etc to get it started.
I suggest we start with an sourceforge project since this is fast to  
setup and if we qualify for apache as an lucene or hadoop subproject  
migrate there later, or is it easy to start a apache incubator project?

We might just need a nice name for the project. Doug, any idea? :-)

Should we start from scratch or with a code contribution?
Someone still want to contribute its implementation?


Thanks.
Stefan







On Feb 6, 2008, at 10:57 AM, Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
> 1) Doug Cutting's "Index Server Project Proposal" at
>    http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
> 2) Solr's "Distributed Search" at
>    http://wiki.apache.org/solr/DistributedSearch
> 3) Mark Butler's "Distributed Lucene" at
>    http://wiki.apache.org/hadoop/DistributedLucene
>
> We have also been working on a Lucene-based distributed index  
> architecture.
> Our design differs from the above proposals in the way it leverages  
> Hadoop
> as much as possible. In particular, HDFS is used to reliably store  
> Lucene
> instances, Map/Reduce is used to analyze documents and update Lucene
> instances
> in parallel, and Hadoop's IPC framework is used. Our design is  
> geared for
> applications that require a highly scalable index and where batch  
> updates
> to each Lucene instance are acceptable (verses finer-grained  
> document at
> a time updates).
>
> We have a working implementation of our design and are in the process
> of evaluating its performance. An overview of our design is provided  
> below.
> We welcome feedback and would like to know if you are interested in  
> working
> on it. If so, we would be happy to make the code publicly available.  
> At the
> same time, we would like to collaborate with people working on  
> existing
> proposals and see if we can consolidate our efforts.
>
> TERMINOLOGY
> A distributed "index" is partitioned into "shards". Each shard  
> corresponds
> to
> a Lucene instance and contains a disjoint subset of the documents in  
> the
> index.
> Each shard is stored in HDFS and served by one or more "shard  
> servers". Here
> we only talk about a single distributed index, but in practice  
> multiple
> indexes
> can be supported.
>
> A "master" keeps track of the shard servers and the shards being  
> served by
> them. An "application" updates and queries the global index through an
> "index client". An index client communicates with the shard servers to
> execute a query.
>
> KEY RPC METHODS
> This section lists the key RPC methods in our design. To simplify the
> discussion, some of their parameters have been omitted.
>
>  On the Shard Servers
>    // Execute a query on this shard server's Lucene instance.
>    // This method is called by an index client.
>    SearchResults search(Query query);
>
>  On the Master
>    // Tell the master to update the shards, i.e., Lucene instances.
>    // This method is called by an index client.
>    boolean updateShards(Configuration conf);
>
>    // Ask the master where the shards are located.
>    // This method is called by an index client.
>    LocatedShards getShardLocations();
>
>    // Send a heartbeat to the master. This method is called by a
>    // shard server. In the response, the master informs the
>    // shard server when to switch to a newer version of the index.
>    ShardServerCommand sendHeartbeat();
>
> QUERYING THE INDEX
> To query the index, an application sends a search request to an index
> client.
> The index client then calls the shard server search() method for  
> each shard
> of the index, merges the results and returns them to the  
> application. The
> index client caches the mapping between shards and shard servers by
> periodically calling the master's getShardLocations() method.
>
> UPDATING THE INDEX USING MAP/REDUCE
> To update the index, an application sends an update request to an  
> index
> client.
> The index client then calls the master's updateShards() method, which
> schedules
> a Map/Reduce job to update the index. The Map/Reduce job updates the  
> shards
> in
> parallel and copies the new index files of each shard (i.e., Lucene
> instance)
> to HDFS.
>
> The updateShards() method includes a "configuration", which provides
> information for updating the shards. More specifically, the  
> configuration
> includes the following information:
>  - Input path. This provides the location of updated documents,  
> e.g., HDFS
>    files or directories, or HBase tables.
>  - Input formatter. This specifies how to format the input documents.
>  - Analysis. This defines the analyzer to use on the input. The  
> analyzer
>    determines whether a document is being inserted, updated, or  
> deleted.
> For
>    inserts or updates, the analyzer also converts each input  
> document into
>    a Lucene document.
>
> The Map phase of the Map/Reduce job formats and analyzes the input (in
> parallel), while the Reduce phase collects and applies the updates  
> to each
> Lucene instance (again in parallel). The updates are applied using  
> the local
> file system where a Reduce task runs and then copied back to HDFS. For
> example,
> if the updates caused a new Lucene segment to be created, the new  
> segment
> would be created on the local file system first, and then copied  
> back to
> HDFS.
>
> When the Map/Reduce job completes, a "new version" of the index is  
> ready to
> be
> queried. It is important to note that the new version of the index  
> is not
> derived from scratch. By leveraging Lucene's update algorithm, the new
> version
> of each Lucene instance will share as many files as possible as the  
> previous
> version.
>
> ENSURING INDEX CONSISTENCY
> At any point in time, an index client always has a consistent view  
> of the
> shards in the index. The results of a search query include either  
> all or
> none
> of a recent update to the index. The details of the algorithm to  
> accomplish
> this are omitted here, but the basic flow is pretty simple.
>
> After the Map/Reduce job to update the shards completes, the master  
> will
> tell
> each shard server to "prepare" the new version of the index. After  
> all the
> shard servers have responded affirmatively to the "prepare" message,  
> the new
>
> index is ready to be queried. An index client will then lazily learn  
> about
> the new index when it makes its next getShardLocations() call to the  
> master.
>
> In essence, a lazy two-phase commit protocol is used, with "prepare"  
> and
> "commit" messages piggybacked on heartbeats. After a shard has  
> switched to
> the new index, the Lucene files in the old index that are no longer  
> needed
> can safely be deleted.
>
> ACHIEVING FAULT-TOLERANCE
> We rely on the fault-tolerance of Map/Reduce to guarantee that an  
> index
> update
> will eventually succeed. All shards are stored in HDFS and can be  
> read by
> any
> shard server in a cluster. For a given shard, if one of its shard  
> servers
> dies,
> new search requests are handled by its surviving shard servers. To  
> ensure
> that
> there is always enough coverage for a shard, the master will  
> instruct other
> shard servers to take over the shards of a dead shard server.
>
> PERFORMANCE ISSUES
> Currently, each shard server reads a shard directly from HDFS.  
> Experiments
> have shown that this approach does not perform very well, with HDFS  
> causing
> Lucene to slow down fairly dramatically (by well over 5x when data  
> blocks
> are
> accessed over the network). Consequently, we are exploring different  
> ways to
> leverage the fault tolerance of HDFS and, at the same time, work  
> around its
> performance problems. One simple alternative is to add a local file  
> system
> cache on each shard server. Another alternative is to modify HDFS so  
> that an
> application has more control over where to store the primary and  
> replicas of
> an HDFS block. This feature may be useful for other HDFS  
> applications (e.g.,
> HBase). We would like to collaborate with other people who are  
> interested in
> adding this feature to HDFS.
>
>
> Regards,
> Ning Li

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com



Re: Lucene-based Distributed Index Leveraging Hadoop

Posted by Doug Cutting <cu...@apache.org>.
Ning,

I am also interested in starting a new project in this area.  The 
approach I have in mind is slightly different, but hopefully we can come 
to some agreement and collaborate.

My current thinking is that the Solr search API is the appropriate 
model.  Solr's facets are an important feature that require low-level 
support to be practical.  Thus a useful distributed search system should 
support facets from the outset, rather than attempt to graft them on 
later.  In particular, I believe this requirement mandates disjoint shards.

My primary difference with your proposal is that I would like to support 
online indexing.  Documents could be inserted and removed directly, and 
shards would synchronize changes amongst replicas, with an "eventual 
consistency" model.  Indexes would not be stored in HDFS, but directly 
on the local disk of each node.  Hadoop would perhaps not play a role. 
In many ways this would resemble CouchDB, but with explicit support for 
sharding and failover from the outset.

A particular client should be able to provide a consistent read/write 
view by bonding to particular replicas of a shard.  Thus a user who 
makes a modification should be able to generally see that modification 
in results immediately, while other users, talking to different 
replicas, may not see it until synchronization is complete.

There are many unresolved issues in my mind around sharding and 
replication that I hope to reach some clarity on before beginning 
implementation.  Does this sound like it could be of interest to you?

Cheers,

Doug

Ning Li wrote:
> There have been several proposals for a Lucene-based distributed index
> architecture.
>  1) Doug Cutting's "Index Server Project Proposal" at
>     http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
>  2) Solr's "Distributed Search" at
>     http://wiki.apache.org/solr/DistributedSearch
>  3) Mark Butler's "Distributed Lucene" at
>     http://wiki.apache.org/hadoop/DistributedLucene
> 
> We have also been working on a Lucene-based distributed index architecture.
> Our design differs from the above proposals in the way it leverages Hadoop
> as much as possible. In particular, HDFS is used to reliably store Lucene
> instances, Map/Reduce is used to analyze documents and update Lucene
> instances
> in parallel, and Hadoop's IPC framework is used. Our design is geared for
> applications that require a highly scalable index and where batch updates
> to each Lucene instance are acceptable (verses finer-grained document at
> a time updates).
> 
> We have a working implementation of our design and are in the process
> of evaluating its performance. An overview of our design is provided below.
> We welcome feedback and would like to know if you are interested in working
> on it. If so, we would be happy to make the code publicly available. At the
> same time, we would like to collaborate with people working on existing
> proposals and see if we can consolidate our efforts.
> 
> TERMINOLOGY
> A distributed "index" is partitioned into "shards". Each shard corresponds
> to
> a Lucene instance and contains a disjoint subset of the documents in the
> index.
> Each shard is stored in HDFS and served by one or more "shard servers". Here
> we only talk about a single distributed index, but in practice multiple
> indexes
> can be supported.
> 
> A "master" keeps track of the shard servers and the shards being served by
> them. An "application" updates and queries the global index through an
> "index client". An index client communicates with the shard servers to
> execute a query.
> 
> KEY RPC METHODS
> This section lists the key RPC methods in our design. To simplify the
> discussion, some of their parameters have been omitted.
> 
>   On the Shard Servers
>     // Execute a query on this shard server's Lucene instance.
>     // This method is called by an index client.
>     SearchResults search(Query query);
> 
>   On the Master
>     // Tell the master to update the shards, i.e., Lucene instances.
>     // This method is called by an index client.
>     boolean updateShards(Configuration conf);
> 
>     // Ask the master where the shards are located.
>     // This method is called by an index client.
>     LocatedShards getShardLocations();
> 
>     // Send a heartbeat to the master. This method is called by a
>     // shard server. In the response, the master informs the
>     // shard server when to switch to a newer version of the index.
>     ShardServerCommand sendHeartbeat();
> 
> QUERYING THE INDEX
> To query the index, an application sends a search request to an index
> client.
> The index client then calls the shard server search() method for each shard
> of the index, merges the results and returns them to the application. The
> index client caches the mapping between shards and shard servers by
> periodically calling the master's getShardLocations() method.
> 
> UPDATING THE INDEX USING MAP/REDUCE
> To update the index, an application sends an update request to an index
> client.
> The index client then calls the master's updateShards() method, which
> schedules
> a Map/Reduce job to update the index. The Map/Reduce job updates the shards
> in
> parallel and copies the new index files of each shard (i.e., Lucene
> instance)
> to HDFS.
> 
> The updateShards() method includes a "configuration", which provides
> information for updating the shards. More specifically, the configuration
> includes the following information:
>   - Input path. This provides the location of updated documents, e.g., HDFS
>     files or directories, or HBase tables.
>   - Input formatter. This specifies how to format the input documents.
>   - Analysis. This defines the analyzer to use on the input. The analyzer
>     determines whether a document is being inserted, updated, or deleted.
> For
>     inserts or updates, the analyzer also converts each input document into
>     a Lucene document.
> 
> The Map phase of the Map/Reduce job formats and analyzes the input (in
> parallel), while the Reduce phase collects and applies the updates to each
> Lucene instance (again in parallel). The updates are applied using the local
> file system where a Reduce task runs and then copied back to HDFS. For
> example,
> if the updates caused a new Lucene segment to be created, the new segment
> would be created on the local file system first, and then copied back to
> HDFS.
> 
> When the Map/Reduce job completes, a "new version" of the index is ready to
> be
> queried. It is important to note that the new version of the index is not
> derived from scratch. By leveraging Lucene's update algorithm, the new
> version
> of each Lucene instance will share as many files as possible as the previous
> version.
> 
> ENSURING INDEX CONSISTENCY
> At any point in time, an index client always has a consistent view of the
> shards in the index. The results of a search query include either all or
> none
> of a recent update to the index. The details of the algorithm to accomplish
> this are omitted here, but the basic flow is pretty simple.
> 
> After the Map/Reduce job to update the shards completes, the master will
> tell
> each shard server to "prepare" the new version of the index. After all the
> shard servers have responded affirmatively to the "prepare" message, the new
> 
> index is ready to be queried. An index client will then lazily learn about
> the new index when it makes its next getShardLocations() call to the master.
> 
> In essence, a lazy two-phase commit protocol is used, with "prepare" and
> "commit" messages piggybacked on heartbeats. After a shard has switched to
> the new index, the Lucene files in the old index that are no longer needed
> can safely be deleted.
> 
> ACHIEVING FAULT-TOLERANCE
> We rely on the fault-tolerance of Map/Reduce to guarantee that an index
> update
> will eventually succeed. All shards are stored in HDFS and can be read by
> any
> shard server in a cluster. For a given shard, if one of its shard servers
> dies,
> new search requests are handled by its surviving shard servers. To ensure
> that
> there is always enough coverage for a shard, the master will instruct other
> shard servers to take over the shards of a dead shard server.
> 
> PERFORMANCE ISSUES
> Currently, each shard server reads a shard directly from HDFS. Experiments
> have shown that this approach does not perform very well, with HDFS causing
> Lucene to slow down fairly dramatically (by well over 5x when data blocks
> are
> accessed over the network). Consequently, we are exploring different ways to
> leverage the fault tolerance of HDFS and, at the same time, work around its
> performance problems. One simple alternative is to add a local file system
> cache on each shard server. Another alternative is to modify HDFS so that an
> application has more control over where to store the primary and replicas of
> an HDFS block. This feature may be useful for other HDFS applications (e.g.,
> HBase). We would like to collaborate with other people who are interested in
> adding this feature to HDFS.
> 
> 
> Regards,
> Ning Li
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org