You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mark Kerzner <ma...@gmail.com> on 2011/06/04 02:57:12 UTC

What's the best approach to search in HBase?

Hi,

I need to store, say, 10M-100M documents, with each document having say 100
fields, like author, creation date, access date, etc., and then I want to
ask questions like

give me all documents whose author is like abc**, and creation date any time
in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
matching a list of some keywords.

What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
and compare of every record?

Thanks a bunch!

Mark

Re: What's the best approach to search in HBase?

Posted by Joey Echeverria <jo...@cloudera.com>.
I can't claim its the best, but I'd say solar or katta.

-Joey
On Jun 3, 2011 8:57 PM, "Mark Kerzner" <ma...@gmail.com> wrote:
> Hi,
>
> I need to store, say, 10M-100M documents, with each document having say
100
> fields, like author, creation date, access date, etc., and then I want to
> ask questions like
>
> give me all documents whose author is like abc**, and creation date any
time
> in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
> matching a list of some keywords.
>
> What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
> and compare of every record?
>
> Thanks a bunch!
>
> Mark

Re: What's the best approach to search in HBase?

Posted by Jason Rutherglen <ja...@gmail.com>.
I think this's the key line: "ElasticSearch is the search analogue to
HBase that frees us from some restrictions that Solr imposes".  It is
quite true however if search is inside of HBase ones gets the same
thing.  Solr does have serious limitations in terms of scaling etc.  I
think ES has done a great job there, though this could have been done
with Solr just as easily, eg, upgrade Solr with the same functionality
and remove the need for schemas.  Solr does allow schema-less, with
eg, dynamic fields.

On Fri, Jun 3, 2011 at 8:36 PM, Matt Davies <ma...@tynt.com> wrote:
> I, for one, am interested in learning more about elasticsearch with HBase
> after reading the article over at StumbleUpon (
> http://www.stumbleupon.com/devblog/searching-for-serendipity/)
>
> Intriguing that it is relatively easy to set up. Anyone else using
> elasticsearch?
>
> -Matt
>
> On Fri, Jun 3, 2011 at 8:23 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>> wrote:
>
>> Mark,
>>
>> 'Add search to HBase' - HBASE-3529 is in development.
>>
>> On Fri, Jun 3, 2011 at 5:57 PM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I need to store, say, 10M-100M documents, with each document having say
>> 100
>> > fields, like author, creation date, access date, etc., and then I want to
>> > ask questions like
>> >
>> > give me all documents whose author is like abc**, and creation date any
>> time
>> > in 2010 and access date in 2010-2011, and so on, perhaps 10-20
>> conditions,
>> > matching a list of some keywords.
>> >
>> > What's best, Lucene, Katta, HBase CF with secondary indices, or plain
>> scan
>> > and compare of every record?
>> >
>> > Thanks a bunch!
>> >
>> > Mark
>> >
>>
>

Re: What's the best approach to search in HBase?

Posted by Matt Davies <ma...@tynt.com>.
I, for one, am interested in learning more about elasticsearch with HBase
after reading the article over at StumbleUpon (
http://www.stumbleupon.com/devblog/searching-for-serendipity/)

Intriguing that it is relatively easy to set up. Anyone else using
elasticsearch?

-Matt

On Fri, Jun 3, 2011 at 8:23 PM, Jason Rutherglen <jason.rutherglen@gmail.com
> wrote:

> Mark,
>
> 'Add search to HBase' - HBASE-3529 is in development.
>
> On Fri, Jun 3, 2011 at 5:57 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > Hi,
> >
> > I need to store, say, 10M-100M documents, with each document having say
> 100
> > fields, like author, creation date, access date, etc., and then I want to
> > ask questions like
> >
> > give me all documents whose author is like abc**, and creation date any
> time
> > in 2010 and access date in 2010-2011, and so on, perhaps 10-20
> conditions,
> > matching a list of some keywords.
> >
> > What's best, Lucene, Katta, HBase CF with secondary indices, or plain
> scan
> > and compare of every record?
> >
> > Thanks a bunch!
> >
> > Mark
> >
>

Re: What's the best approach to search in HBase?

Posted by Jason Rutherglen <ja...@gmail.com>.
Mark,

'Add search to HBase' - HBASE-3529 is in development.

On Fri, Jun 3, 2011 at 5:57 PM, Mark Kerzner <ma...@gmail.com> wrote:
> Hi,
>
> I need to store, say, 10M-100M documents, with each document having say 100
> fields, like author, creation date, access date, etc., and then I want to
> ask questions like
>
> give me all documents whose author is like abc**, and creation date any time
> in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
> matching a list of some keywords.
>
> What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
> and compare of every record?
>
> Thanks a bunch!
>
> Mark
>

Re: What's the best approach to search in HBase?

Posted by Jason Rutherglen <ja...@gmail.com>.
That doesn't look like it's open source though?  Isn't it an SaS?

On Sat, Jun 4, 2011 at 11:15 AM, M. C. Srivas <mc...@gmail.com> wrote:
> There's also DrawnToScale http://www.drawntoscale.com/.
>
> Don't know if its released or not.
>
> On Sat, Jun 4, 2011 at 11:07 AM, Steven Noels <st...@outerthought.org>wrote:
>
>> On Sat, Jun 4, 2011 at 2:57 AM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I need to store, say, 10M-100M documents, with each document having say
>> 100
>> > fields, like author, creation date, access date, etc., and then I want to
>> > ask questions like
>> >
>> > give me all documents whose author is like abc**, and creation date any
>> > time
>> > in 2010 and access date in 2010-2011, and so on, perhaps 10-20
>> conditions,
>> > matching a list of some keywords.
>> >
>> > What's best, Lucene, Katta, HBase CF with secondary indices, or plain
>> scan
>> > and compare of every record?
>> >
>>
>> I'd say give Lily a spin. Currently, we rely on Solr for search. In the
>> next
>> few months, we'll take a good look at "HBase-native" secondary indexes as
>> well.
>>
>> Lily can be found at www.lilyproject.org.
>>
>> Thanks,
>>
>> Steven.
>> --
>> Steven Noels
>> http://outerthought.org/
>> Scalable Smart Data
>> Makers of Kauri, Daisy CMS and Lily
>>
>

Re: What's the best approach to search in HBase?

Posted by "M. C. Srivas" <mc...@gmail.com>.
There's also DrawnToScale http://www.drawntoscale.com/.

Don't know if its released or not.

On Sat, Jun 4, 2011 at 11:07 AM, Steven Noels <st...@outerthought.org>wrote:

> On Sat, Jun 4, 2011 at 2:57 AM, Mark Kerzner <ma...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I need to store, say, 10M-100M documents, with each document having say
> 100
> > fields, like author, creation date, access date, etc., and then I want to
> > ask questions like
> >
> > give me all documents whose author is like abc**, and creation date any
> > time
> > in 2010 and access date in 2010-2011, and so on, perhaps 10-20
> conditions,
> > matching a list of some keywords.
> >
> > What's best, Lucene, Katta, HBase CF with secondary indices, or plain
> scan
> > and compare of every record?
> >
>
> I'd say give Lily a spin. Currently, we rely on Solr for search. In the
> next
> few months, we'll take a good look at "HBase-native" secondary indexes as
> well.
>
> Lily can be found at www.lilyproject.org.
>
> Thanks,
>
> Steven.
> --
> Steven Noels
> http://outerthought.org/
> Scalable Smart Data
> Makers of Kauri, Daisy CMS and Lily
>

Re: What's the best approach to search in HBase?

Posted by Steven Noels <st...@outerthought.org>.
On Sat, Jun 4, 2011 at 2:57 AM, Mark Kerzner <ma...@gmail.com> wrote:

> Hi,
>
> I need to store, say, 10M-100M documents, with each document having say 100
> fields, like author, creation date, access date, etc., and then I want to
> ask questions like
>
> give me all documents whose author is like abc**, and creation date any
> time
> in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
> matching a list of some keywords.
>
> What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
> and compare of every record?
>

I'd say give Lily a spin. Currently, we rely on Solr for search. In the next
few months, we'll take a good look at "HBase-native" secondary indexes as
well.

Lily can be found at www.lilyproject.org.

Thanks,

Steven.
-- 
Steven Noels
http://outerthought.org/
Scalable Smart Data
Makers of Kauri, Daisy CMS and Lily

Re: What's the best approach to search in HBase?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
HBasene is dead.  Watch HBASE-3529.
Otis
We're hiring HBase / Hadoop / Hive / Mahout engineers with interest in Big Data Mining and Analytics
http://blog.sematext.com/2011/04/18/hiring-data-mining-analytics-machine-learning-hackers/


From: "Hiller, Dean  x66079" <de...@broadridge.com>

To: "user@hbase.apache.org" <us...@hbase.apache.org>
>Sent: Friday, June 17, 2011 4:21 PM
>Subject: RE: What's the best approach to search in HBase?
>
>What about using Hbasene....is it pretty good....looks just like a distributed Lucene and the same api and everything?
>
>Later,
>Dean
>
>-----Original Message-----
>From: Mark Kerzner [mailto:markkerzner@gmail.com] 
>Sent: Wednesday, June 15, 2011 10:10 PM
>To: user@hbase.apache.org
>Subject: Re: What's the best approach to search in HBase?
>
>Thank you, everybody. I summarized your advice here,
>http://shmsoft.blogspot.com/2011/06/search-in-ediscovery.html, because I
>need it for my open source eDiscovery, and now just need to try it all :)
>
>Sincerely,
>Mark
>
>On Mon, Jun 6, 2011 at 11:18 AM, Buttler, David <bu...@llnl.gov> wrote:
>
>> I store over 500M documents in HBase, and index using Solr with dynamic
>> fields.  This gives you tremendous flexibility to do the type of queries you
>> are looking for -- and to make them simple and intuitive via a faceted
>> interface.
>>
>> However, there was quite a bit of software that we had to write to get
>> things going, and I can neither release all of it open source, or support
>> other people using it.  If I had to start again, I would seriously look at
>> solutions like elastic search and lily.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Mark Kerzner [mailto:markkerzner@gmail.com]
>> Sent: Friday, June 03, 2011 5:57 PM
>> To: HBase Discussion Group
>> Subject: What's the best approach to search in HBase?
>>
>> Hi,
>>
>> I need to store, say, 10M-100M documents, with each document having say 100
>> fields, like author, creation date, access date, etc., and then I want to
>> ask questions like
>>
>> give me all documents whose author is like abc**, and creation date any
>> time
>> in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
>> matching a list of some keywords.
>>
>> What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
>> and compare of every record?
>>
>> Thanks a bunch!
>>
>> Mark
>>
>This message and any attachments are intended only for the use of the addressee and
>may contain information that is privileged and confidential. If the reader of the 
>message is not the intended recipient or an authorized representative of the
>intended recipient, you are hereby notified that any dissemination of this
>communication is strictly prohibited. If you have received this communication in
>error, please notify us immediately by e-mail and delete the message and any
>attachments from your system.
>
>
>
>

RE: What's the best approach to search in HBase?

Posted by "Hiller, Dean x66079" <de...@broadridge.com>.
What about using Hbasene....is it pretty good....looks just like a distributed Lucene and the same api and everything?

Later,
Dean

-----Original Message-----
From: Mark Kerzner [mailto:markkerzner@gmail.com] 
Sent: Wednesday, June 15, 2011 10:10 PM
To: user@hbase.apache.org
Subject: Re: What's the best approach to search in HBase?

Thank you, everybody. I summarized your advice here,
http://shmsoft.blogspot.com/2011/06/search-in-ediscovery.html, because I
need it for my open source eDiscovery, and now just need to try it all :)

Sincerely,
Mark

On Mon, Jun 6, 2011 at 11:18 AM, Buttler, David <bu...@llnl.gov> wrote:

> I store over 500M documents in HBase, and index using Solr with dynamic
> fields.  This gives you tremendous flexibility to do the type of queries you
> are looking for -- and to make them simple and intuitive via a faceted
> interface.
>
> However, there was quite a bit of software that we had to write to get
> things going, and I can neither release all of it open source, or support
> other people using it.  If I had to start again, I would seriously look at
> solutions like elastic search and lily.
>
> Dave
>
> -----Original Message-----
> From: Mark Kerzner [mailto:markkerzner@gmail.com]
> Sent: Friday, June 03, 2011 5:57 PM
> To: HBase Discussion Group
> Subject: What's the best approach to search in HBase?
>
> Hi,
>
> I need to store, say, 10M-100M documents, with each document having say 100
> fields, like author, creation date, access date, etc., and then I want to
> ask questions like
>
> give me all documents whose author is like abc**, and creation date any
> time
> in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
> matching a list of some keywords.
>
> What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
> and compare of every record?
>
> Thanks a bunch!
>
> Mark
>
This message and any attachments are intended only for the use of the addressee and
may contain information that is privileged and confidential. If the reader of the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.


Re: What's the best approach to search in HBase?

Posted by Mark Kerzner <ma...@gmail.com>.
Thank you, everybody. I summarized your advice here,
http://shmsoft.blogspot.com/2011/06/search-in-ediscovery.html, because I
need it for my open source eDiscovery, and now just need to try it all :)

Sincerely,
Mark

On Mon, Jun 6, 2011 at 11:18 AM, Buttler, David <bu...@llnl.gov> wrote:

> I store over 500M documents in HBase, and index using Solr with dynamic
> fields.  This gives you tremendous flexibility to do the type of queries you
> are looking for -- and to make them simple and intuitive via a faceted
> interface.
>
> However, there was quite a bit of software that we had to write to get
> things going, and I can neither release all of it open source, or support
> other people using it.  If I had to start again, I would seriously look at
> solutions like elastic search and lily.
>
> Dave
>
> -----Original Message-----
> From: Mark Kerzner [mailto:markkerzner@gmail.com]
> Sent: Friday, June 03, 2011 5:57 PM
> To: HBase Discussion Group
> Subject: What's the best approach to search in HBase?
>
> Hi,
>
> I need to store, say, 10M-100M documents, with each document having say 100
> fields, like author, creation date, access date, etc., and then I want to
> ask questions like
>
> give me all documents whose author is like abc**, and creation date any
> time
> in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
> matching a list of some keywords.
>
> What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
> and compare of every record?
>
> Thanks a bunch!
>
> Mark
>

RE: What's the best approach to search in HBase?

Posted by "Buttler, David" <bu...@llnl.gov>.
I store over 500M documents in HBase, and index using Solr with dynamic fields.  This gives you tremendous flexibility to do the type of queries you are looking for -- and to make them simple and intuitive via a faceted interface.

However, there was quite a bit of software that we had to write to get things going, and I can neither release all of it open source, or support other people using it.  If I had to start again, I would seriously look at solutions like elastic search and lily.

Dave

-----Original Message-----
From: Mark Kerzner [mailto:markkerzner@gmail.com] 
Sent: Friday, June 03, 2011 5:57 PM
To: HBase Discussion Group
Subject: What's the best approach to search in HBase?

Hi,

I need to store, say, 10M-100M documents, with each document having say 100
fields, like author, creation date, access date, etc., and then I want to
ask questions like

give me all documents whose author is like abc**, and creation date any time
in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions,
matching a list of some keywords.

What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan
and compare of every record?

Thanks a bunch!

Mark