You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by TuX RaceR <tu...@gmail.com> on 2010/03/11 11:07:21 UTC

random access and hotspots

Hello List,

I'll be accessing a table mainly in random access and I am looking for 
an efficient way of randomizing the keys.
I thought about a MD5 hash of the ID of the record, but as MD5 returns a 
string of chars [0-9A-F] I was wondering if there was a better method to 
use.

Thanks
TuX

Re: random access and hotspots

Posted by TuX RaceR <tu...@gmail.com>.
Thanks Alex, for you help.
Cheers
TuX

Alex Baranov wrote:
> _In case your only use-case is searching_ then you might think about Solr.
> "few 10 of millions" documents can be handled by it gracefully. Solr has
> also solutions for splitting index and replicating (for load balancing).
> Another thing I'd suggest to consider is Lucandra (
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
> ).
> There are also some movements around creating Lucene index in HBase (
> http://www.search-hadoop.com/m?id=201003091546.40151.thomas@koch.ro).
>
> Check http://wiki.apache.org/hadoop/DistributedLucene page for similar
> solutions.
>
> Alex Baranau
>
> http://sematext.com
> http://en.wordpress.com/tag/hadoop-ecosystem-digest/
>
> On Thu, Mar 11, 2010 at 11:40 PM, TuX RaceR <tu...@gmail.com> wrote:
>
>   
>> Hi Alex
>>
>> Thanks again for your detailed answer.
>>
>>
>> Alex Baranov wrote:
>>
>>  So, 2 to 50 columns in each row. In case the single row size (in bytes) is
>>     
>>>  not large then if requests load (number of concurrent clients which perform
>>> described queries) is heavy, then you probably should consider simple data
>>> duplication. I.e. rows with composite keys (which you've put in "Indexes"
>>> table) will contain all data you need.
>>>
>>>       
>> Yes, that is what I thought too after reading the typical read performance
>> at:
>>
>>
>> http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77596@mail.gmail.com
>> Having to do 100 random access to generate just one web page would be too
>> costly.
>> Even more that the list pages pointing to the documents do not need to show
>> the whole document content (just the information necessary to generate a
>> link and maybe a short summary)
>>
>>
>>
>>  Given the fact the total count of all
>>     
>>> rows would be 1-10bil this might work well for you. Of course this would
>>> work if your data isn't changes over time (immutable).
>>>
>>>       
>> Yes, the data may change over time, also not very often. This is the
>> biggest headache I have when designing solution using Hbase ;) : updating
>> indexes.
>>
>>
>>  Also, have you considered IHBase and related secondary indexes
>>     
>>> implementations: the one from transactional contrib, lucene-hbase?
>>>
>>>
>>>       
>> I have already looked well at solr and a bit at elasticsearch.
>> I was interested in Hbase because of the scaling capabilities.
>> My current system is beginning to show the limits of Postgresql. I could
>> move the slow requests based on a SQL index to a Lucene based index, and
>> then move to Hbase when the site gets bigger. Or I could invest the time now
>> in a Hbase solution and do not use an intermediary (lucene based) stage.
>> That's not decided yet.
>>
>> I do not understand very well Hbase secondary indexes and what are the
>> advantages with respect to hand made indexes. Does using Hbase secondary
>> indexes help when the data is mutable? Or is that because indexes are
>> created in a transaction making data consistency stronger?
>>
>> Thanks
>> TuX
>>
>>     
>
>   


Re: random access and hotspots

Posted by Alex Baranov <al...@gmail.com>.
_In case your only use-case is searching_ then you might think about Solr.
"few 10 of millions" documents can be handled by it gracefully. Solr has
also solutions for splitting index and replicating (for load balancing).
Another thing I'd suggest to consider is Lucandra (
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
).
There are also some movements around creating Lucene index in HBase (
http://www.search-hadoop.com/m?id=201003091546.40151.thomas@koch.ro).

Check http://wiki.apache.org/hadoop/DistributedLucene page for similar
solutions.

Alex Baranau

http://sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 11:40 PM, TuX RaceR <tu...@gmail.com> wrote:

> Hi Alex
>
> Thanks again for your detailed answer.
>
>
> Alex Baranov wrote:
>
>  So, 2 to 50 columns in each row. In case the single row size (in bytes) is
>>  not large then if requests load (number of concurrent clients which perform
>> described queries) is heavy, then you probably should consider simple data
>> duplication. I.e. rows with composite keys (which you've put in "Indexes"
>> table) will contain all data you need.
>>
>
> Yes, that is what I thought too after reading the typical read performance
> at:
>
>
> http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77596@mail.gmail.com
> Having to do 100 random access to generate just one web page would be too
> costly.
> Even more that the list pages pointing to the documents do not need to show
> the whole document content (just the information necessary to generate a
> link and maybe a short summary)
>
>
>
>  Given the fact the total count of all
>> rows would be 1-10bil this might work well for you. Of course this would
>> work if your data isn't changes over time (immutable).
>>
> Yes, the data may change over time, also not very often. This is the
> biggest headache I have when designing solution using Hbase ;) : updating
> indexes.
>
>
>  Also, have you considered IHBase and related secondary indexes
>> implementations: the one from transactional contrib, lucene-hbase?
>>
>>
> I have already looked well at solr and a bit at elasticsearch.
> I was interested in Hbase because of the scaling capabilities.
> My current system is beginning to show the limits of Postgresql. I could
> move the slow requests based on a SQL index to a Lucene based index, and
> then move to Hbase when the site gets bigger. Or I could invest the time now
> in a Hbase solution and do not use an intermediary (lucene based) stage.
> That's not decided yet.
>
> I do not understand very well Hbase secondary indexes and what are the
> advantages with respect to hand made indexes. Does using Hbase secondary
> indexes help when the data is mutable? Or is that because indexes are
> created in a transaction making data consistency stronger?
>
> Thanks
> TuX
>

Re: random access and hotspots

Posted by TuX RaceR <tu...@gmail.com>.
Hi Alex

Thanks again for your detailed answer.

Alex Baranov wrote:

> So, 2 to 50 columns in each row. In case the single row size (in 
> bytes) is  not large then if requests load (number of concurrent 
> clients which perform described queries) is heavy, then you probably 
> should consider simple data duplication. I.e. rows with composite keys 
> (which you've put in "Indexes" table) will contain all data you need.

Yes, that is what I thought too after reading the typical read 
performance at:
http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77596@mail.gmail.com
Having to do 100 random access to generate just one web page would be 
too costly.
Even more that the list pages pointing to the documents do not need to 
show the whole document content (just the information necessary to 
generate a link and maybe a short summary)


> Given the fact the total count of all
> rows would be 1-10bil this might work well for you. Of course this would
> work if your data isn't changes over time (immutable). 
Yes, the data may change over time, also not very often. This is the 
biggest headache I have when designing solution using Hbase ;) : 
updating indexes.

> Also, have you considered IHBase and related secondary indexes
> implementations: the one from transactional contrib, lucene-hbase?
>   
I have already looked well at solr and a bit at elasticsearch.
I was interested in Hbase because of the scaling capabilities.
My current system is beginning to show the limits of Postgresql. I could 
move the slow requests based on a SQL index to a Lucene based index, and 
then move to Hbase when the site gets bigger. Or I could invest the time 
now in a Hbase solution and do not use an intermediary (lucene based) 
stage. That's not decided yet.

I do not understand very well Hbase secondary indexes and what are the 
advantages with respect to hand made indexes. Does using Hbase secondary 
indexes help when the data is mutable? Or is that because indexes are 
created in a transaction making data consistency stronger?

Thanks
TuX

Re: random access and hotspots

Posted by Alex Baranov <al...@gmail.com>.
>
> How many columns Random table would have?
>
> few 10 of millions (10^7)

> What is the row size?
>
> Rows will contain from two to 50 columns

You probably meant "few 10 of millions (10^7)" is a row count.

So, 2 to 50 columns in each row. In case the single row size (in bytes) is
not large then if requests load (number of concurrent clients which perform
described queries) is heavy, then you probably should consider simple data
duplication. I.e. rows with composite keys (which you've put in "Indexes"
table) will contain all data you need. Given the fact the total count of all
rows would be 1-10bil this might work well for you. Of course this would
work if your data isn't changes over time (immutable). If you have just a
few clients then keeping in mind the
> 100 is probably an upper bound
random queries should work fast enough.

Also, have you considered IHBase and related secondary indexes
implementations: the one from transactional contrib, lucene-hbase?

Alex Baranau

sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 6:01 PM, TuX RaceR <tu...@gmail.com> wrote:

> Hello Alex,
>
> Thank you for your mail.
>
>
> Alex Baranov wrote:
>
>> How many columns Random table would have?
>>
> few 10 of millions (10^7)
>
>
>  What is the row size?
>>
> Rows will contain from two to 50 columns
>
>  How many
>> rows are you going to fetch at one time (I assume just for displaying one
>> page with 10, 20, 100 records?)?
>>
>>
> Yes, that's correct: 100 is probably an upper bound.
>
>
>  How big is your data (estimated rows count)?
>>
>>
> few 10's of millions for Radom, 100 to 1000 times more for Indexes
>
>  How many different types of "indexes" are you planning to have?
>>
>>
>>
> around 50 Indexes
>
>  ...I need to fork concurrent threads/processes to get the document
>>>
>>>
>> details...
>>
>> Yes, increasing number of threads/processes will increase performance.
>>
>>
>>
>>> how many random search hbase will stand
>>>
>>>
>> This depends on your hardware. Have a look in MLs for some details shared
>> by
>> others, like:
>>
>> http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77596@mail.gmail.com
>>
>>
>>
>
>
> Thanks for the Links
> cheers
> TuX
>
>
>
>  one new socket (is that true?) is created at each random access request
>>>
>>>
>> No, but this (obviously) cause new ipc call (one can use Scan.get(int
>> count)
>> to fetch more rows at single call).
>>
>> Alex Baranau
>>
>>
>
>

Re: random access and hotspots

Posted by TuX RaceR <tu...@gmail.com>.
Hello Alex,

Thank you for your mail.

Alex Baranov wrote:
> How many columns Random table would have? 
few 10 of millions (10^7)

> What is the row size? 
Rows will contain from two to 50 columns
> How many
> rows are you going to fetch at one time (I assume just for displaying one
> page with 10, 20, 100 records?)?
>   
Yes, that's correct: 100 is probably an upper bound.

> How big is your data (estimated rows count)?
>   
few 10's of millions for Radom, 100 to 1000 times more for Indexes
> How many different types of "indexes" are you planning to have?
>
>   
around 50 Indexes
>> ...I need to fork concurrent threads/processes to get the document
>>     
> details...
>
> Yes, increasing number of threads/processes will increase performance.
>
>   
>> how many random search hbase will stand
>>     
> This depends on your hardware. Have a look in MLs for some details shared by
> others, like:
> http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77596@mail.gmail.com
>
>   


Thanks for the Links
cheers
TuX


>> one new socket (is that true?) is created at each random access request
>>     
> No, but this (obviously) cause new ipc call (one can use Scan.get(int count)
> to fetch more rows at single call).
>
> Alex Baranau
>   


Re: random access and hotspots

Posted by Alex Baranov <al...@gmail.com>.
How many columns Random table would have? What is the row size? How many
rows are you going to fetch at one time (I assume just for displaying one
page with 10, 20, 100 records?)?
How big is your data (estimated rows count)?
How many different types of "indexes" are you planning to have?


> ...I need to fork concurrent threads/processes to get the document
details...

Yes, increasing number of threads/processes will increase performance.

> how many random search hbase will stand
This depends on your hardware. Have a look in MLs for some details shared by
others, like:
http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77596@mail.gmail.com

> one new socket (is that true?) is created at each random access request
No, but this (obviously) cause new ipc call (one can use Scan.get(int count)
to fetch more rows at single call).

Alex Baranau

sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 3:37 PM, TuX RaceR <tu...@gmail.com> wrote:

> Thanks Alex for  your answer.
>
> I am not yet at a stage where I can measure the performance (I am still at
> the db design stage, initial population) but my understanding what that
> randomizing the keys was a way of avoiding keys hotspots.
> To simplify let's assume that have documents attached to users that I need
> to search by date.
> I have two tables: one "Random" optimized to random access and one
> "Indexes" optimized for sequential access scanners.
>
> 'Random' stores document details:
> Random:
> doc_1-> Title:"some title1",Text:"some longer
> text1",user:1,CreateDate:2010-01-01
> doc_2-> Title:"some title2",Text:"some longer
> text2",user:1,CreateDate:2010-01-02
> ....
>
> 'Indexes' stores document indexes (for instance here is an index on date
> and date+user):
> date_2100101:id:1
> date_2100102:id:2
> ...
> date_user1_2100101:id:1
> date_user1_2100102:id:2
>
>
> As a user typically add many documents in a short period of time, it is
> usual to have that documents obtained by the scanner are also in the same
> order in the Random table (without randomization).
> So, once I get the IDs of the documents from the scanner query, I need to
> fork concurrent threads/processes to get the document details: that (from
> what I understand) would create a key hotspot in the 'Random' table.
> Is my reasoning above correct? My feeling is that a typical hbase
> application do both scanner/random access patterns alternatively.
>
> Another question I have until I test this is how many random search hbase
> will stand. The scanner will present links to the documents (paging
> implemantion), so I am not sure what a realistic value of document per page
> could be: 10, 20 or 100? As (at least) one new socket (is that true?) is
> created at each random access request, I am affraid such a design could
> bring the hbase layer down (until maybe
> http://issues.apache.org/jira/browse/HBASE-1845 is fixed)
>
> Thanks
> TuX
>
>
>
>
> Alex Baranov wrote:
>
>> Hello Tux,
>>
>> Accessing a table in "random access"-manner is not the reason for
>> randomizing keys. You will likely need to randomize your keys only for
>> better performance during importing existed large dataset into HBase.
>> Otherwise if you don't have insertion rate bigger than 20K records/sec I
>> wouldn't suggest you to think about this issue. It would be great if you
>> tell us more about your use-case.
>>
>> MD5, SHA-1 or Jenkins Hash (in org.apache.hadoop.hbase.util.JenkinsHash)
>> are
>> all mechanisms you might consider.
>>
>> Alex Baranau
>>
>> sematext.com
>> http://en.wordpress.com/tag/hadoop-ecosystem-digest/
>>
>> On Thu, Mar 11, 2010 at 12:07 PM, TuX RaceR <tu...@gmail.com> wrote:
>>
>>
>>
>>> Hello List,
>>>
>>> I'll be accessing a table mainly in random access and I am looking for an
>>> efficient way of randomizing the keys.
>>> I thought about a MD5 hash of the ID of the record, but as MD5 returns a
>>> string of chars [0-9A-F] I was wondering if there was a better method to
>>> use.
>>>
>>> Thanks
>>> TuX
>>>
>>>
>>>
>>
>>
>>
>
>

Re: random access and hotspots

Posted by TuX RaceR <tu...@gmail.com>.
Thanks Alex for  your answer.

I am not yet at a stage where I can measure the performance (I am still 
at the db design stage, initial population) but my understanding what 
that randomizing the keys was a way of avoiding keys hotspots.
To simplify let's assume that have documents attached to users that I 
need to search by date.
I have two tables: one "Random" optimized to random access and one 
"Indexes" optimized for sequential access scanners.

'Random' stores document details:
Random:
doc_1-> Title:"some title1",Text:"some longer 
text1",user:1,CreateDate:2010-01-01
doc_2-> Title:"some title2",Text:"some longer 
text2",user:1,CreateDate:2010-01-02
....

'Indexes' stores document indexes (for instance here is an index on date 
and date+user):
date_2100101:id:1
date_2100102:id:2
...
date_user1_2100101:id:1
date_user1_2100102:id:2


As a user typically add many documents in a short period of time, it is 
usual to have that documents obtained by the scanner are also in the 
same order in the Random table (without randomization).
So, once I get the IDs of the documents from the scanner query, I need 
to fork concurrent threads/processes to get the document details: that 
(from what I understand) would create a key hotspot in the 'Random' table.
Is my reasoning above correct? My feeling is that a typical hbase 
application do both scanner/random access patterns alternatively.

Another question I have until I test this is how many random search 
hbase will stand. The scanner will present links to the documents 
(paging implemantion), so I am not sure what a realistic value of 
document per page could be: 10, 20 or 100? As (at least) one new socket 
(is that true?) is created at each random access request, I am affraid 
such a design could bring the hbase layer down (until maybe 
http://issues.apache.org/jira/browse/HBASE-1845 is fixed)

Thanks
TuX



Alex Baranov wrote:
> Hello Tux,
>
> Accessing a table in "random access"-manner is not the reason for
> randomizing keys. You will likely need to randomize your keys only for
> better performance during importing existed large dataset into HBase.
> Otherwise if you don't have insertion rate bigger than 20K records/sec I
> wouldn't suggest you to think about this issue. It would be great if you
> tell us more about your use-case.
>
> MD5, SHA-1 or Jenkins Hash (in org.apache.hadoop.hbase.util.JenkinsHash) are
> all mechanisms you might consider.
>
> Alex Baranau
>
> sematext.com
> http://en.wordpress.com/tag/hadoop-ecosystem-digest/
>
> On Thu, Mar 11, 2010 at 12:07 PM, TuX RaceR <tu...@gmail.com> wrote:
>
>   
>> Hello List,
>>
>> I'll be accessing a table mainly in random access and I am looking for an
>> efficient way of randomizing the keys.
>> I thought about a MD5 hash of the ID of the record, but as MD5 returns a
>> string of chars [0-9A-F] I was wondering if there was a better method to
>> use.
>>
>> Thanks
>> TuX
>>
>>     
>
>   


Re: random access and hotspots

Posted by Alex Baranov <al...@gmail.com>.
Hello Tux,

Accessing a table in "random access"-manner is not the reason for
randomizing keys. You will likely need to randomize your keys only for
better performance during importing existed large dataset into HBase.
Otherwise if you don't have insertion rate bigger than 20K records/sec I
wouldn't suggest you to think about this issue. It would be great if you
tell us more about your use-case.

MD5, SHA-1 or Jenkins Hash (in org.apache.hadoop.hbase.util.JenkinsHash) are
all mechanisms you might consider.

Alex Baranau

sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 12:07 PM, TuX RaceR <tu...@gmail.com> wrote:

> Hello List,
>
> I'll be accessing a table mainly in random access and I am looking for an
> efficient way of randomizing the keys.
> I thought about a MD5 hash of the ID of the record, but as MD5 returns a
> string of chars [0-9A-F] I was wondering if there was a better method to
> use.
>
> Thanks
> TuX
>