You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Tim Robertson <ti...@gmail.com> on 2010/03/27 19:46:00 UTC

elastic search or other Lucene for HBase?

Hi all,

Is anyone using elastic search as an indexing layer to HBase content?
It looks to have a really nice API, and was thinking of setting up an
EC2 test where I maintain an ES index storing only the Key to HBase
rows.  So ES provides all search returning Key lists and all single
record Get being served from HBase.

Or is there a preferred distributed Lucene approach for HBase from the
few that have been popping up?  I have not had a chance to really dig
into the options but I know there has been a lot of chatter on this.

If no one has tried ES, I'll post some test results with MR based building.

Cheers,
Tim

Re: elastic search or other Lucene for HBase?

Posted by Daniel Einspanjer <de...@mozilla.com>.

  Mozilla is taking a hard look at using Elastic Search as an 
indexing/searching mechanism for Socorro 2.0.  We're evaluating the 
possibility of using HBASE-2001 patch as a mechanism to be able to hook 
in NRT indexing of the documents.

-Daniel

On 6/3/10 5:36 PM, Steven Noels wrote:
> On Thu, Jun 3, 2010 at 4:58 PM, Otis Gospodnetic<otis_gospodnetic@yahoo.com
>> wrote:
> Wow, Steven, you really did your homework well!  Major A+ in my book. :)
>
> I have a real talent for copy/paste, but credits for the write-up need to go
> to my colleague Bruno Dumon!
>
> Steven.

Re: elastic search or other Lucene for HBase?

Posted by Steven Noels <st...@outerthought.org>.

On Thu, Jun 3, 2010 at 4:58 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com
> wrote:

Wow, Steven, you really did your homework well!  Major A+ in my book. :)
>


I have a real talent for copy/paste, but credits for the write-up need to go
to my colleague Bruno Dumon!

Steven.
-- 
Steven Noels                            http://outerthought.org/
Outerthought                            Open Source Java & XML
stevenn at outerthought.org             Makers of the Daisy CMS

Re: elastic search or other Lucene for HBase?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Wow, Steven, you really did your homework well!  Major A+ in my book. :)

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Steven Noels <st...@outerthought.org>
> To: hbase-user <hb...@hadoop.apache.org>
> Sent: Thu, June 3, 2010 3:33:04 AM
> Subject: Re: elastic search or other Lucene for HBase?
> 
> On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson <
> ymailto="mailto:timrobertson100@gmail.com" 
> href="mailto:timrobertson100@gmail.com">timrobertson100@gmail.com>wrote:

> 
> Hi all,
>
> Is anyone using elastic search as an indexing layer to 
> HBase content?
> It looks to have a really nice API, and was thinking of 
> setting up an
> EC2 test where I maintain an ES index storing only the Key 
> to HBase
> rows.  So ES provides all search returning Key lists and 
> all single
> record Get being served from HBase.
>
> Or is 
> there a preferred distributed Lucene approach for HBase from the
> few 
> that have been popping up?  I have not had a chance to really dig
> 
> into the options but I know there has been a lot of chatter on 
> this.
>


For Lily - www.lilycms.org, we opted for SOLR. Here's 
> some rationale behind
that (copy-pasted from our draft Lily 
> website):

Selecting a search solution: SOLR

For search, the choice 
> for Lucene as core technology was pretty much a
given. In Daisy, our previous 
> CMS, we used Lucene only for full-text search
and performed structural 
> searches on the SQL database. We merged the results
from those two different 
> search technologies on the fly, supporting mixed
structural and full-text 
> queries. However, this merging, combined with other
high-level features of 
> Daisy, was not designed to handle very large data
sets. For Lily, we decided 
> that a better approach would be to perform all
searching using one 
> technology, Lucene.

A downside to Lucene is that index updates are only 
> visible with some delay
to searchers, though work is ongoing to improve this. 
> At its heart it is a
text-search library, though with its fielded documents 
> and the trie-range
queries, it handles more data-oriented queries quite 
> well.

Lucene in itself is a library, not a standalone application, nor a 
> scalable
search solution. But all this can be built on top. The best known 
> standalone
search server on top of Lucene is SOLR, which we decided to use in 
> Lily.

But before we made that choice, we considered a lot of the 
> available
options:

   -

   Katta <
> href="http://katta.sourceforge.net/" target=_blank 
> >http://katta.sourceforge.net/>. Katta provides a powerful 
> scalable
   search model whereby each node is responsible for searching 
> on a number of
   shards, replicas of the shards are present on multiple 
> of the nodes. This
   provides scaling for both index size and number of 
> users, and gracefully
   handles node failures since the shards that 
> were on a failed node will be
   available online on some other nodes. 
> However, Katta is only a search
   solution, not an indexing solution, 
> and does not offer extra search features
   such as faceting.
  
> -

   Hadoop contrib/index<
> href="http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README" 
> target=_blank 
> >http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README>.
  
> This is a MapReduce solution for building Lucene indexes. The nice 
> thing
   about it is that the MR framework manages spreading the index 
> building work
   over multiple nodes, reschedules failed jobs, and so 
> on. It can also be used
   to update existing indexes. The number of 
> index shards is determined by the
   number of reduce tasks. Hadoop 
> contrib/index is an ideal complement to
   Katta. The downside is that 
> it is inherently batch-oriented, which excludes
   profiting from the 
> ongoing Lucene near-real time (NRT) work.
   -

   The tools 
> from LinkedIn <
> >http://sna-projects.com/sna/>
(blog<
> href="http://invertedindex.blogspot.com/" target=_blank 
> >http://invertedindex.blogspot.com/>).
   LinkedIn has made 
> available some cool Lucene-related projects like
Bobo<
> href="http://sna-projects.com/bobo/" target=_blank 
> >http://sna-projects.com/bobo/>,
   an optimized facet browser 
> that does not rely on cached bitsets,
and Zoie<
> href="http://sna-projects.com/zoie/" target=_blank 
> >http://sna-projects.com/zoie/>,
   a real-time index-search 
> solution (built in a different way than what is
   available in Lucene 
> 3). They are apparently integrating it all in
Sensei<
> href="http://sna-projects.com/sensei/" target=_blank 
> >http://sna-projects.com/sensei/>.
   It is interesting to study 
> the design of these projects.
   -

   ElasticSearch <
> href="http://www.elasticsearch.com/" target=_blank 
> >http://www.elasticsearch.com/>. ElasticSearch (ES) is a
   very 
> new project, that appeared as a one-man project at about the same time
  
> we made our choice for SOLR. One can easily launch a number of ES 
> nodes,
   they find themselves without configuration. Multiple indexes 
> can be created
   using a simple REST API. When creating an index, you 
> specify the number of
   shards and replicas you desire. It is designed 
> to work on cloud computing
   solutions like EC2, where the local disk 
> is only a temporary storage. There
   is a lot more to tell, but you can 
> read that on their website. Despite the
   name 'elastic', it does not 
> support indexes growing dynamically in the same
   way as tables can 
> grow in HBase: the number of shards is fixed when creating
   the index. 
> However, if you find yourself in need of more shards, you can
   create 
> a new index with more shards and re-index your content into that. The
  
> number of shards is not related to the number of nodes, so you can plan 
> for
   growth by choosing e.g. 10 shards even if you have just one or 
> two nodes to
   start with.
   -

   Lucandra<
> href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/" 
> target=_blank 
> >http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/>,
  
> Lucehbase <
> >http://github.com/thkoch2001/lucehbase> and
Hbasene<
> href="http://github.com/akkumar/hbasene" target=_blank 
> >http://github.com/akkumar/hbasene>.
   These projects work by 
> storing the inverted index on top of Cassandra
   respectively HBase. 
> The use of a database is quite different from Lucene's
   segment-based 
> approach. While it makes the storage of the inverted index
   scalable, 
> it does not necessarily make all of Lucene's functionality
   scalable, 
> such as sorting and faceting which depend on the field caches and
  
> bitset-based filters. Moreover, for HBase, which we know best, the 
> storage
   is not as scalable as it may seem, since terms are stored as 
> rows and the
   postings lists (= the documents containing the term) as 
> columns. Usually the
   number of terms in a corpus is relatively 
> limited, while the number of
   documents can be huge, but columns in 
> HBase do not scale in the same way as
   rows. We think the scaling 
> (sharding, replication) needs to happen on the
   level of Lucene 
> instances itself, rather than just the storage below it.
   Still, it is 
> interesting to watch how these projects will evolve.
   -

  
> Building our own. Another option was to just take Lucene itself and 
> build
   our own scalable search solution using it. In this case we 
> would have gone
   for a Katta/ElasticSearch-like approach to sharding 
> and replication, with a
   focus on the search features we are most 
> interested in (such as faceting).
   However, we decided that this would 
> take too much of our time.
   - Our choice, SOLR <
> href="http://lucene.apache.org/solr/" target=_blank 
> >http://lucene.apache.org/solr/>. SOLR is a standalone
  
> Lucene-based search server made publicly available in 2006, making it 
> the
   oldest of the solutions listed here. It makes a lot of Lucene 
> functionality
   easily available, adds a schema with field types, 
> faceting, different kinds
   of caches and cache warming, a 
> user-friendly safe query syntax, and more.
   SOLR supports replicas and 
> distributed searching over multiple SOLR
   instances, though you are 
> responsible for setting it up all yourself, it is
   very much a static 
> solution. Work on cloudifying
SOLR<
> href="http://svn.apache.org/repos/asf/lucene/solr/branches/cloud" target=_blank 
> >http://svn.apache.org/repos/asf/lucene/solr/branches/cloud>is
ongoing. 
> SOLR has lots of users, there is a book, there are companies
  
> supporting it, it has a large team <
> target=_blank >http://www.ohloh.net/p/solr>, and the
   Lucene 
> and SOLR projects recently merged.


We don't expect Lily to be 
> deployed on EC2 infrastructure, but more in
private cloud/datacenter settings 
> with customers. However, should ES have
appeared sooner on our radar, there's 
> a good chance we would have looked at
it more in-depth. SOLR cloud uses 
> ZooKeeper which we'll need between Lily
clients and servers anyway, so that 
> was a nice fit with our architecture.

I haven't compared the search 
> language/features between SOLR and ES myself
though, it could be that you 
> need specific stuff from ES (like: the EC2
focus).

Hope this 
> helps,

Steven.
-- 
Steven Noels          
>                   
> href="http://outerthought.org/" target=_blank 
> >http://outerthought.org/
Outerthought          
>                   Open Source Java 
> & XML
stevenn at outerthought.org          
>    Makers of the Daisy CMS

Re: elastic search or other Lucene for HBase?

Posted by Steven Noels <st...@outerthought.org>.

On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson <ti...@gmail.com>wrote:

> Hi all,
>
> Is anyone using elastic search as an indexing layer to HBase content?
> It looks to have a really nice API, and was thinking of setting up an
> EC2 test where I maintain an ES index storing only the Key to HBase
> rows.  So ES provides all search returning Key lists and all single
> record Get being served from HBase.
>
> Or is there a preferred distributed Lucene approach for HBase from the
> few that have been popping up?  I have not had a chance to really dig
> into the options but I know there has been a lot of chatter on this.
>

For Lily - www.lilycms.org, we opted for SOLR. Here's some rationale behind
that (copy-pasted from our draft Lily website):

Selecting a search solution: SOLR

For search, the choice for Lucene as core technology was pretty much a
given. In Daisy, our previous CMS, we used Lucene only for full-text search
and performed structural searches on the SQL database. We merged the results
from those two different search technologies on the fly, supporting mixed
structural and full-text queries. However, this merging, combined with other
high-level features of Daisy, was not designed to handle very large data
sets. For Lily, we decided that a better approach would be to perform all
searching using one technology, Lucene.

A downside to Lucene is that index updates are only visible with some delay
to searchers, though work is ongoing to improve this. At its heart it is a
text-search library, though with its fielded documents and the trie-range
queries, it handles more data-oriented queries quite well.

Lucene in itself is a library, not a standalone application, nor a scalable
search solution. But all this can be built on top. The best known standalone
search server on top of Lucene is SOLR, which we decided to use in Lily.

But before we made that choice, we considered a lot of the available
options:

   -

   Katta <http://katta.sourceforge.net/>. Katta provides a powerful scalable
   search model whereby each node is responsible for searching on a number of
   shards, replicas of the shards are present on multiple of the nodes. This
   provides scaling for both index size and number of users, and gracefully
   handles node failures since the shards that were on a failed node will be
   available online on some other nodes. However, Katta is only a search
   solution, not an indexing solution, and does not offer extra search features
   such as faceting.
   -

   Hadoop contrib/index<http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README>.
   This is a MapReduce solution for building Lucene indexes. The nice thing
   about it is that the MR framework manages spreading the index building work
   over multiple nodes, reschedules failed jobs, and so on. It can also be used
   to update existing indexes. The number of index shards is determined by the
   number of reduce tasks. Hadoop contrib/index is an ideal complement to
   Katta. The downside is that it is inherently batch-oriented, which excludes
   profiting from the ongoing Lucene near-real time (NRT) work.
   -

   The tools from LinkedIn <http://sna-projects.com/sna/>
(blog<http://invertedindex.blogspot.com/>).
   LinkedIn has made available some cool Lucene-related projects like
Bobo<http://sna-projects.com/bobo/>,
   an optimized facet browser that does not rely on cached bitsets,
and Zoie<http://sna-projects.com/zoie/>,
   a real-time index-search solution (built in a different way than what is
   available in Lucene 3). They are apparently integrating it all in
Sensei<http://sna-projects.com/sensei/>.
   It is interesting to study the design of these projects.
   -

   ElasticSearch <http://www.elasticsearch.com/>. ElasticSearch (ES) is a
   very new project, that appeared as a one-man project at about the same time
   we made our choice for SOLR. One can easily launch a number of ES nodes,
   they find themselves without configuration. Multiple indexes can be created
   using a simple REST API. When creating an index, you specify the number of
   shards and replicas you desire. It is designed to work on cloud computing
   solutions like EC2, where the local disk is only a temporary storage. There
   is a lot more to tell, but you can read that on their website. Despite the
   name 'elastic', it does not support indexes growing dynamically in the same
   way as tables can grow in HBase: the number of shards is fixed when creating
   the index. However, if you find yourself in need of more shards, you can
   create a new index with more shards and re-index your content into that. The
   number of shards is not related to the number of nodes, so you can plan for
   growth by choosing e.g. 10 shards even if you have just one or two nodes to
   start with.
   -

   Lucandra<http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/>,
   Lucehbase <http://github.com/thkoch2001/lucehbase> and
Hbasene<http://github.com/akkumar/hbasene>.
   These projects work by storing the inverted index on top of Cassandra
   respectively HBase. The use of a database is quite different from Lucene's
   segment-based approach. While it makes the storage of the inverted index
   scalable, it does not necessarily make all of Lucene's functionality
   scalable, such as sorting and faceting which depend on the field caches and
   bitset-based filters. Moreover, for HBase, which we know best, the storage
   is not as scalable as it may seem, since terms are stored as rows and the
   postings lists (= the documents containing the term) as columns. Usually the
   number of terms in a corpus is relatively limited, while the number of
   documents can be huge, but columns in HBase do not scale in the same way as
   rows. We think the scaling (sharding, replication) needs to happen on the
   level of Lucene instances itself, rather than just the storage below it.
   Still, it is interesting to watch how these projects will evolve.
   -

   Building our own. Another option was to just take Lucene itself and build
   our own scalable search solution using it. In this case we would have gone
   for a Katta/ElasticSearch-like approach to sharding and replication, with a
   focus on the search features we are most interested in (such as faceting).
   However, we decided that this would take too much of our time.
   - Our choice, SOLR <http://lucene.apache.org/solr/>. SOLR is a standalone
   Lucene-based search server made publicly available in 2006, making it the
   oldest of the solutions listed here. It makes a lot of Lucene functionality
   easily available, adds a schema with field types, faceting, different kinds
   of caches and cache warming, a user-friendly safe query syntax, and more.
   SOLR supports replicas and distributed searching over multiple SOLR
   instances, though you are responsible for setting it up all yourself, it is
   very much a static solution. Work on cloudifying
SOLR<http://svn.apache.org/repos/asf/lucene/solr/branches/cloud>is
ongoing. SOLR has lots of users, there is a book, there are companies
   supporting it, it has a large team <http://www.ohloh.net/p/solr>, and the
   Lucene and SOLR projects recently merged.

We don't expect Lily to be deployed on EC2 infrastructure, but more in
private cloud/datacenter settings with customers. However, should ES have
appeared sooner on our radar, there's a good chance we would have looked at
it more in-depth. SOLR cloud uses ZooKeeper which we'll need between Lily
clients and servers anyway, so that was a nice fit with our architecture.

I haven't compared the search language/features between SOLR and ES myself
though, it could be that you need specific stuff from ES (like: the EC2
focus).

Hope this helps,

Steven.
-- 
Steven Noels                            http://outerthought.org/
Outerthought                            Open Source Java & XML
stevenn at outerthought.org             Makers of the Daisy CMS

Re: elastic search or other Lucene for HBase?

Posted by Thomas Koch <th...@koch.ro>.

Hi Tim,

just this week I announced the porting of lucandra (Lucene on Cassandra) to 
HBase:
http://permalink.gmane.org/gmane.comp.java.hadoop.hbase.user/9118

While searching the link to my mail I also found this, which I should look 
into monday:
http://issues.apache.org/jira/browse/HBASE-270

Best regards, 

Thomas Koch


Tim Robertson:
> Hi all,
> 
> Is anyone using elastic search as an indexing layer to HBase content?
> It looks to have a really nice API, and was thinking of setting up an
> EC2 test where I maintain an ES index storing only the Key to HBase
> rows.  So ES provides all search returning Key lists and all single
> record Get being served from HBase.
> 
> Or is there a preferred distributed Lucene approach for HBase from the
> few that have been popping up?  I have not had a chance to really dig
> into the options but I know there has been a lot of chatter on this.
> 
> If no one has tried ES, I'll post some test results with MR based building.
> 
> Cheers,
> Tim
> 

Thomas Koch, http://www.koch.ro

Re: elastic search or other Lucene for HBase?

Posted by Tim Robertson <ti...@gmail.com>.

Hi Otis,

Other some basic tests on EC2, I'm afraid not.  I was initially
pondering using ElasticSearch as the front end REST and hooking in the
CRUD to HBase underneath, but since HBase is not my primary store (in
truth it is something I only get time to fire up and play with
occasionally at the moment) I normally do a MR load to HBase so would
go in under the ES layer.  For our use cases, custom indexes for
search, with 'lite' record content in a separate cluster to HBase is
an attractive option, leaving HBase free to handle reporting and full
record detail serving, annotating etc, and also allowing different
scaling based on the ultimate load.

>From my playing around with ES, it seems promising to act as the front
end layer - pretty good docs, nice API.  I have not hammered it with
any load though.

Sorry I can't be of more help,
Tim

On Thu, Jun 3, 2010 at 7:15 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Tim, ever done anything with HBase and ES?
> I'm interested in both together and apart .... http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/
>
>  Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
>> From: Tim Robertson <ti...@gmail.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Sat, March 27, 2010 2:46:00 PM
>> Subject: elastic search or other Lucene for HBase?
>>
>> Hi all,
>
> Is anyone using elastic search as an indexing layer to HBase
>> content?
> It looks to have a really nice API, and was thinking of setting up
>> an
> EC2 test where I maintain an ES index storing only the Key to
>> HBase
> rows.  So ES provides all search returning Key lists and all
>> single
> record Get being served from HBase.
>
> Or is there a preferred
>> distributed Lucene approach for HBase from the
> few that have been popping
>> up?  I have not had a chance to really dig
> into the options but I know
>> there has been a lot of chatter on this.
>
> If no one has tried ES, I'll
>> post some test results with MR based building.
>
> Cheers,
> Tim
>

Re: elastic search or other Lucene for HBase?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Tim, ever done anything with HBase and ES?
I'm interested in both together and apart .... http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Tim Robertson <ti...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sat, March 27, 2010 2:46:00 PM
> Subject: elastic search or other Lucene for HBase?
> 
> Hi all,

Is anyone using elastic search as an indexing layer to HBase 
> content?
It looks to have a really nice API, and was thinking of setting up 
> an
EC2 test where I maintain an ES index storing only the Key to 
> HBase
rows.  So ES provides all search returning Key lists and all 
> single
record Get being served from HBase.

Or is there a preferred 
> distributed Lucene approach for HBase from the
few that have been popping 
> up?  I have not had a chance to really dig
into the options but I know 
> there has been a lot of chatter on this.

If no one has tried ES, I'll 
> post some test results with MR based building.

Cheers,
Tim