You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Sharma, Raghvendra" <sr...@corelogic.com> on 2010/09/27 06:27:31 UTC

Fast querying mechanism for hbase data ?

I am running a little test/poc here.

I need to load a few million rows every day into a database. And it's not log file data, I have comma delimited rows (of columns) which would exactly fit a relational database.

After the loading, I need to allow a very fast search mechanism. Looking a bit at Google's implementation of bigtable and structure around it, I originally thought of using hive integrated with hbase. Hive because of its querying capabilities. The loading works out fine, better than RDBMS perf. However, the querying bottleneck, which was the reason to look for alternatives to RDBMS in the first place, continues with hive too.

Testing hive for querying is not really blazing performance. Perhaps I need to look for alternatives..

Is there something else ? any other tool/solution/library that I can put on top of hbase ? or even without hbase ? (I looked at hbase as an alternative to the RDBMS, moving towards dist computing)

Suggestions please...

--raghav..
******************************************************************************************
This message may contain confidential or proprietary information intended only for the use of the
addressee(s) named above or may contain information that is legally privileged. If you are
not the intended addressee, or the person responsible for delivering it to the intended addressee,
you are hereby notified that reading, disseminating, distributing or copying this message is strictly
prohibited. If you have received this message by mistake, please immediately notify us by
replying to the message and delete the original message and any copies immediately thereafter.

Thank you.
******************************************************************************************
CLLD

Re: Fast querying mechanism for hbase data ?

Posted by Steven Noels <st...@outerthought.org>.

On Mon, Sep 27, 2010 at 7:30 AM, Sharma, Raghvendra <
sraghvendra@corelogic.com> wrote:

Thanks for the responses...
>


> But I want to emphasize that... we are dealing with a relatively broad base
> here... around 150 million rows...
> And quite a few columns...few hundred perhaps...  Therefore, I am a bit
> apprehensive with writing a fresh piece here, for which I would need to do
> whole lot of testing...
>
> Preferably, I am looking for an existing piece, for which some testing
> would have happened already... if there is one like that..
>

If it's store, index and search you're after, and your ingestion rates are
in the normal range, check out Lily trunk. We'll have a release end of
October finishing up on full distribution, but trunk already supports SOLR
sharding. The Mozilla datawarehouse people have been playing with it already
as well.

www.lilyproject.org == HBase+SOLR.

Cheers,

Steven.
-- 
Steven Noels
http://outerthought.org/
Open Source Content Applications
Makers of Kauri, Daisy CMS and Lily

RE: Fast querying mechanism for hbase data ?

Posted by "Sharma, Raghvendra" <sr...@corelogic.com>.

Thanks for the responses... 
But I want to emphasize that... we are dealing with a relatively broad base here... around 150 million rows...
And quite a few columns...few hundred perhaps...  Therefore, I am a bit apprehensive with writing a fresh piece here, for which I would need to do whole lot of testing...

Preferably, I am looking for an existing piece, for which some testing would have happened already... if there is one like that.. 

@Jack - when writing your application in C++, what API did you use there ? the standard hbase one ?? how was the experience ? can you please share a bit more...

Regards
Raghav..

-----Original Message-----
From: Jack Levin [mailto:magnito@gmail.com] 
Sent: Monday, September 27, 2010 10:52 AM
To: user@hbase.apache.org
Subject: Re: Fast querying mechanism for hbase data ?

You could just write an application in any language that would query
your rows, put them in memory, then do any sort of sorting or
processing.  Use REST api, and you are done.

We did an experiment of sorting/querying by using C++, and it was
quite impressive with 10k rows.

-Jack

On Sun, Sep 26, 2010 at 9:57 PM, Imran M Yousuf <im...@gmail.com> wrote:
> Hi Raghav,
>
> You could try Apache Solr along with HBase. Apache Solr is designed
> for Full Text search and works in various modes in terms of storing
> indexes.
> http://lucene.apache.org/solr/
> http://github.com/akkumar/hbasene [Provides a distributed system to
> use HBase as the backing store for the TF-IDF representation, as
> needed by Lucene]
> http://www.lilyproject.org/lily/index.html [Cloud-scalable NoSQL-based
> content store and search repository, built on top of Apache HBase and
> SOLR]
>
> If your requirement is not real-time in nature you may also try the
> Scanner API of HBase Client.
> http://hbase.apache.org/docs/r0.89.20100726/apidocs/index.html
>
> Regards,
>
> Imran
>
> On Mon, Sep 27, 2010 at 10:27 AM, Sharma, Raghvendra
> <sr...@corelogic.com> wrote:
>> I am running a little test/poc here.
>>
>> I need to load a few million rows every day into a database. And it's not log file data, I have comma delimited rows (of columns) which would exactly fit a relational database.
>>
>> After the loading, I need to allow a very fast search mechanism. Looking a bit at Google's implementation of bigtable and structure around it, I originally thought of using hive integrated with hbase. Hive because of its querying capabilities.  The loading works out fine, better than RDBMS perf. However, the querying bottleneck, which was the reason to look for alternatives to RDBMS in the first place, continues with hive too.
>>
>> Testing hive for querying is not really blazing performance. Perhaps I need to look for alternatives..
>>
>> Is there something else ? any other tool/solution/library that I can put on top of hbase ? or even without hbase ? (I looked at hbase as an alternative to the RDBMS, moving towards dist computing)
>>
>> Suggestions please...
>>
>> --raghav..
>> ******************************************************************************************
>> This message may contain confidential or proprietary information intended only for the use of the
>> addressee(s) named above or may contain information that is legally privileged. If you are
>> not the intended addressee, or the person responsible for delivering it to the intended addressee,
>> you are hereby notified that reading, disseminating, distributing or copying this message is strictly
>> prohibited. If you have received this message by mistake, please immediately notify us by
>> replying to the message and delete the original message and any copies immediately thereafter.
>>
>> Thank you.
>> ******************************************************************************************
>> CLLD
>>
>
>
>
> --
> Imran M Yousuf
> Entrepreneur & CEO
> Smart IT Engineering Ltd.
> Dhaka, Bangladesh
> Twitter: @imyousuf - http://twitter.com/imyousuf
> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
> Mobile: +880-1711402557
>

Re: Fast querying mechanism for hbase data ?

Posted by Jack Levin <ma...@gmail.com>.

You could just write an application in any language that would query
your rows, put them in memory, then do any sort of sorting or
processing.  Use REST api, and you are done.

We did an experiment of sorting/querying by using C++, and it was
quite impressive with 10k rows.

-Jack

On Sun, Sep 26, 2010 at 9:57 PM, Imran M Yousuf <im...@gmail.com> wrote:
> Hi Raghav,
>
> You could try Apache Solr along with HBase. Apache Solr is designed
> for Full Text search and works in various modes in terms of storing
> indexes.
> http://lucene.apache.org/solr/
> http://github.com/akkumar/hbasene [Provides a distributed system to
> use HBase as the backing store for the TF-IDF representation, as
> needed by Lucene]
> http://www.lilyproject.org/lily/index.html [Cloud-scalable NoSQL-based
> content store and search repository, built on top of Apache HBase and
> SOLR]
>
> If your requirement is not real-time in nature you may also try the
> Scanner API of HBase Client.
> http://hbase.apache.org/docs/r0.89.20100726/apidocs/index.html
>
> Regards,
>
> Imran
>
> On Mon, Sep 27, 2010 at 10:27 AM, Sharma, Raghvendra
> <sr...@corelogic.com> wrote:
>> I am running a little test/poc here.
>>
>> I need to load a few million rows every day into a database. And it's not log file data, I have comma delimited rows (of columns) which would exactly fit a relational database.
>>
>> After the loading, I need to allow a very fast search mechanism. Looking a bit at Google's implementation of bigtable and structure around it, I originally thought of using hive integrated with hbase. Hive because of its querying capabilities.  The loading works out fine, better than RDBMS perf. However, the querying bottleneck, which was the reason to look for alternatives to RDBMS in the first place, continues with hive too.
>>
>> Testing hive for querying is not really blazing performance. Perhaps I need to look for alternatives..
>>
>> Is there something else ? any other tool/solution/library that I can put on top of hbase ? or even without hbase ? (I looked at hbase as an alternative to the RDBMS, moving towards dist computing)
>>
>> Suggestions please...
>>
>> --raghav..
>> ******************************************************************************************
>> This message may contain confidential or proprietary information intended only for the use of the
>> addressee(s) named above or may contain information that is legally privileged. If you are
>> not the intended addressee, or the person responsible for delivering it to the intended addressee,
>> you are hereby notified that reading, disseminating, distributing or copying this message is strictly
>> prohibited. If you have received this message by mistake, please immediately notify us by
>> replying to the message and delete the original message and any copies immediately thereafter.
>>
>> Thank you.
>> ******************************************************************************************
>> CLLD
>>
>
>
>
> --
> Imran M Yousuf
> Entrepreneur & CEO
> Smart IT Engineering Ltd.
> Dhaka, Bangladesh
> Twitter: @imyousuf - http://twitter.com/imyousuf
> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
> Mobile: +880-1711402557
>

Re: Fast querying mechanism for hbase data ?

Posted by Imran M Yousuf <im...@gmail.com>.

Hi Raghav,

You could try Apache Solr along with HBase. Apache Solr is designed
for Full Text search and works in various modes in terms of storing
indexes.
http://lucene.apache.org/solr/
http://github.com/akkumar/hbasene [Provides a distributed system to
use HBase as the backing store for the TF-IDF representation, as
needed by Lucene]
http://www.lilyproject.org/lily/index.html [Cloud-scalable NoSQL-based
content store and search repository, built on top of Apache HBase and
SOLR]

If your requirement is not real-time in nature you may also try the
Scanner API of HBase Client.
http://hbase.apache.org/docs/r0.89.20100726/apidocs/index.html

Regards,

Imran

On Mon, Sep 27, 2010 at 10:27 AM, Sharma, Raghvendra
<sr...@corelogic.com> wrote:
> I am running a little test/poc here.
>
> I need to load a few million rows every day into a database. And it's not log file data, I have comma delimited rows (of columns) which would exactly fit a relational database.
>
> After the loading, I need to allow a very fast search mechanism. Looking a bit at Google's implementation of bigtable and structure around it, I originally thought of using hive integrated with hbase. Hive because of its querying capabilities.  The loading works out fine, better than RDBMS perf. However, the querying bottleneck, which was the reason to look for alternatives to RDBMS in the first place, continues with hive too.
>
> Testing hive for querying is not really blazing performance. Perhaps I need to look for alternatives..
>
> Is there something else ? any other tool/solution/library that I can put on top of hbase ? or even without hbase ? (I looked at hbase as an alternative to the RDBMS, moving towards dist computing)
>
> Suggestions please...
>
> --raghav..
> ******************************************************************************************
> This message may contain confidential or proprietary information intended only for the use of the
> addressee(s) named above or may contain information that is legally privileged. If you are
> not the intended addressee, or the person responsible for delivering it to the intended addressee,
> you are hereby notified that reading, disseminating, distributing or copying this message is strictly
> prohibited. If you have received this message by mistake, please immediately notify us by
> replying to the message and delete the original message and any copies immediately thereafter.
>
> Thank you.
> ******************************************************************************************
> CLLD
>



-- 
Imran M Yousuf
Entrepreneur & CEO
Smart IT Engineering Ltd.
Dhaka, Bangladesh
Twitter: @imyousuf - http://twitter.com/imyousuf
Blog: http://imyousuf-tech.blogs.smartitengineering.com/
Mobile: +880-1711402557