You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Murali Krishna. P" <mu...@yahoo.com> on 2010/09/02 19:43:46 UTC

HBase secondary index performance

Hi,
    I have an indexedtable with index on around 20 columns. The write 
performance on the original table is around 60 per second. This is just a one 
node setup. Even with mutiple parallel clients, I am getting just 60 
writes/second. That means a total write of 60 * 20 = 1200 writes/second due to 
20 indextables? This is not good enough for our application. Is this number 1200 
look right ? I was expecting around 15k.
    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core, 
7.2k rpm disk). Will adding nodes increase the writes linearly?

 Thanks,
Murali Krishna

Re: HBase secondary index performance

Posted by Samuru Jackson <sa...@googlemail.com>.
Hi,

I'm not sure if I understand your problems completely, but relating to your
update issue:

HBase keeps versions of your columns. If you have an index on something that
needs to be updated you just overwrite the value in the index. There is no
need to remove things.

I also organize my indexes in separate tables. There is one table for each
indexed column of a table and I also keep separate tables for composite
indexes.

For a fast retrieval I created an indexmanager table which I can use to
retrieve the corrsponding indexes for attributes and also keep statistics
about them for query planning for instance.


Cheers!

/SJ
-----------
http://uncinuscloud.blogspot.com/


On Sat, Sep 4, 2010 at 9:55 AM, Murali Krishna. P <mu...@yahoo.com>wrote:

> Thanks Samuru,
>    I was reading about custom indexing in habse, just wanted to know how
> are we
> handling the updates incase of custom indexing. Probably if the original
> data
> doesn't change, it might be a good solution. Say,  if one of the column
> value
> gets changed in the original table, we need to query the index table for
> the
> orignal column value, delete it and then add an entry for the new value. I
> think
> this will run into consistency issues since we are doing it in a
> non-transactional manner.
>
>    Are we always doing full indexing and not worry about increments ?  May
> be I
> am missing something here since I am new to this.
>
> My requirements are such that daily updates are around 10 million records
> where
> most of it are just updates and we want it to be real time (or NRT). Any
> suggestions are appreciated.
>
> Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Samuru Jackson <sa...@googlemail.com>
> To: user@hbase.apache.org
> Sent: Fri, 3 September, 2010 6:24:16 PM
> Subject: Re: HBase secondary index performance
>
> Hi,
>
> I wrote my own Indexer and actually I have a pretty good performance.
> However, there are still known places where I could gain even more
> performance (just not having the time right now).
>
> What is important is to create bulk loads when you are indexing something.
> I
> posted this one before, but maybe you have missed it:
>
> I create a Put List out of those records:
>
> List<Put> pList = new ArrayList<Put>();
>
> where each Put has WriteToWAL set to false;
>
> put.setWriteToWAL(false);
> pList.add(p);
>
> Then I set autoflush to false and create a larger writebuffer:
>
> hTable.setAutoFlush(false);
> hTable.setWriteBufferSize(
> 1024*1024*12);
> hTable.put(pList);
> hTable.setAutoFlush(true);
>
> The following settings have boosted my load performance 5times -
> without any bigger performance tunings, no special HW  configuration I
> achieve 8000-9000 records per second:
> p.setWriteToWAL(false);
> hTable.setAutoFlush(false);
> hTable.setWriteBufferSize(1024*1024*12);
>
>
> /SJ
> http://uncinuscloud.blogspot.com/
>
>
>
>
>
>
>
> On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <muralikpbhat@yahoo.com
> >wrote:
>
> > Thanks Andrey,
> >
> >        * Setting the autoflush to false and increasing the writeBuffer
> size
> > to 12MB
> > improved the writes to 100/s
> >        * custom indexing is good, but our data keeps changing every day.
> > So, probably
> > indextable is the best option for us
> >        * Just added one more regionserver and it did not help. Actually
> it
> > went back
> > to 60/s for some strange reason(with one client). The requests in the
> hbase
> > ui
> > is not uniform across 2 region servers. One server is doing around 2000
> and
> > the
> > other 500. Probably once the region gets split and when we have lots of
> > data,
> > writes will improve ? (Now it is just writing to one region for the main
> > table)
> >        * Is there some way to do bulk load the indexedtable? Earlier I
> have
> > used the
> > bulk loader tool (mapreduce job which creates the regions offline) but
> not
> > sure
> > whether it works with indexed table.
> >
> >
> >  Thanks,
> > Murali Krishna
> >
> >
> >
> >
> > ________________________________
> > From: Andrey Stepachev <oc...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Fri, 3 September, 2010 12:14:29 AM
> > Subject: Re: HBase secondary index performance
> >
> > First, check that you connection not in autoflash mode.
> > Second, you can think about custom indexing instead
> > of using indexedtable. In my experience custom idexing
> > (especially if data doesn't modified), is much more performant.
> > Problem with indexedtable is in fact, that on every insert
> > hbase performs one (random) get operation (to check, that we doesn't
> > have previous indexed data, and delete if it exists).  Random gets are
> > lays around 100-400 request per node, so you get 60 looks good
> > (for indexedtable).
> >
> > How to build custom indexes you can read
> >
> >
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
> >/
> >
> >
> > 2010/9/2 Murali Krishna. P <mu...@yahoo.com>:
> > > Hi,
> > >    I have an indexedtable with index on around 20 columns. The write
> > > performance on the original table is around 60 per second. This is just
> a
> > one
> > > node setup. Even with mutiple parallel clients, I am getting just 60
> > > writes/second. That means a total write of 60 * 20 = 1200 writes/second
> > due to
> > > 20 indextables? This is not good enough for our application. Is this
> > number
> > >1200
> > > look right ? I was expecting around 15k.
> > >    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram,
> > 2core,
> > > 7.2k rpm disk). Will adding nodes increase the writes linearly?
> > >
> > >  Thanks,
> > > Murali Krishna
> > >
> >
>



-- 
/SJ
-----------
http://uncinuscloud.blogspot.com/

Re: HBase secondary index performance

Posted by "Murali Krishna. P" <mu...@yahoo.com>.
Thanks Samuru,
    I was reading about custom indexing in habse, just wanted to know how are we 
handling the updates incase of custom indexing. Probably if the original data 
doesn't change, it might be a good solution. Say,  if one of the column value 
gets changed in the original table, we need to query the index table for the 
orignal column value, delete it and then add an entry for the new value. I think 
this will run into consistency issues since we are doing it in a 
non-transactional manner. 

    Are we always doing full indexing and not worry about increments ?  May be I 
am missing something here since I am new to this. 

My requirements are such that daily updates are around 10 million records where 
most of it are just updates and we want it to be real time (or NRT). Any 
suggestions are appreciated.

Thanks,
Murali Krishna




________________________________
From: Samuru Jackson <sa...@googlemail.com>
To: user@hbase.apache.org
Sent: Fri, 3 September, 2010 6:24:16 PM
Subject: Re: HBase secondary index performance

Hi,

I wrote my own Indexer and actually I have a pretty good performance.
However, there are still known places where I could gain even more
performance (just not having the time right now).

What is important is to create bulk loads when you are indexing something. I
posted this one before, but maybe you have missed it:

I create a Put List out of those records:

List<Put> pList = new ArrayList<Put>();

where each Put has WriteToWAL set to false;

put.setWriteToWAL(false);
pList.add(p);

Then I set autoflush to false and create a larger writebuffer:

hTable.setAutoFlush(false);
hTable.setWriteBufferSize(
1024*1024*12);
hTable.put(pList);
hTable.setAutoFlush(true);

The following settings have boosted my load performance 5times -
without any bigger performance tunings, no special HW  configuration I
achieve 8000-9000 records per second:
p.setWriteToWAL(false);
hTable.setAutoFlush(false);
hTable.setWriteBufferSize(1024*1024*12);


/SJ
http://uncinuscloud.blogspot.com/







On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <mu...@yahoo.com>wrote:

> Thanks Andrey,
>
>        * Setting the autoflush to false and increasing the writeBuffer size
> to 12MB
> improved the writes to 100/s
>        * custom indexing is good, but our data keeps changing every day.
> So, probably
> indextable is the best option for us
>        * Just added one more regionserver and it did not help. Actually it
> went back
> to 60/s for some strange reason(with one client). The requests in the hbase
> ui
> is not uniform across 2 region servers. One server is doing around 2000 and
> the
> other 500. Probably once the region gets split and when we have lots of
> data,
> writes will improve ? (Now it is just writing to one region for the main
> table)
>        * Is there some way to do bulk load the indexedtable? Earlier I have
> used the
> bulk loader tool (mapreduce job which creates the regions offline) but not
> sure
> whether it works with indexed table.
>
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <oc...@gmail.com>
> To: user@hbase.apache.org
> Sent: Fri, 3 September, 2010 12:14:29 AM
> Subject: Re: HBase secondary index performance
>
> First, check that you connection not in autoflash mode.
> Second, you can think about custom indexing instead
> of using indexedtable. In my experience custom idexing
> (especially if data doesn't modified), is much more performant.
> Problem with indexedtable is in fact, that on every insert
> hbase performs one (random) get operation (to check, that we doesn't
> have previous indexed data, and delete if it exists).  Random gets are
> lays around 100-400 request per node, so you get 60 looks good
> (for indexedtable).
>
> How to build custom indexes you can read
>
>http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
>/
>
>
> 2010/9/2 Murali Krishna. P <mu...@yahoo.com>:
> > Hi,
> >    I have an indexedtable with index on around 20 columns. The write
> > performance on the original table is around 60 per second. This is just a
> one
> > node setup. Even with mutiple parallel clients, I am getting just 60
> > writes/second. That means a total write of 60 * 20 = 1200 writes/second
> due to
> > 20 indextables? This is not good enough for our application. Is this
> number
> >1200
> > look right ? I was expecting around 15k.
> >    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram,
> 2core,
> > 7.2k rpm disk). Will adding nodes increase the writes linearly?
> >
> >  Thanks,
> > Murali Krishna
> >
>

Re: HBase secondary index performance

Posted by Samuru Jackson <sa...@googlemail.com>.
Hi,

I wrote my own Indexer and actually I have a pretty good performance.
However, there are still known places where I could gain even more
performance (just not having the time right now).

What is important is to create bulk loads when you are indexing something. I
posted this one before, but maybe you have missed it:

I create a Put List out of those records:

List<Put> pList = new ArrayList<Put>();

where each Put has WriteToWAL set to false;

put.setWriteToWAL(false);
pList.add(p);

Then I set autoflush to false and create a larger writebuffer:

hTable.setAutoFlush(false);
hTable.setWriteBufferSize(
1024*1024*12);
hTable.put(pList);
hTable.setAutoFlush(true);

The following settings have boosted my load performance 5times -
without any bigger performance tunings, no special HW  configuration I
achieve 8000-9000 records per second:
p.setWriteToWAL(false);
hTable.setAutoFlush(false);
hTable.setWriteBufferSize(1024*1024*12);


/SJ
http://uncinuscloud.blogspot.com/







On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <mu...@yahoo.com>wrote:

> Thanks Andrey,
>
>        * Setting the autoflush to false and increasing the writeBuffer size
> to 12MB
> improved the writes to 100/s
>        * custom indexing is good, but our data keeps changing every day.
> So, probably
> indextable is the best option for us
>        * Just added one more regionserver and it did not help. Actually it
> went back
> to 60/s for some strange reason(with one client). The requests in the hbase
> ui
> is not uniform across 2 region servers. One server is doing around 2000 and
> the
> other 500. Probably once the region gets split and when we have lots of
> data,
> writes will improve ? (Now it is just writing to one region for the main
> table)
>        * Is there some way to do bulk load the indexedtable? Earlier I have
> used the
> bulk loader tool (mapreduce job which creates the regions offline) but not
> sure
> whether it works with indexed table.
>
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <oc...@gmail.com>
> To: user@hbase.apache.org
> Sent: Fri, 3 September, 2010 12:14:29 AM
> Subject: Re: HBase secondary index performance
>
> First, check that you connection not in autoflash mode.
> Second, you can think about custom indexing instead
> of using indexedtable. In my experience custom idexing
> (especially if data doesn't modified), is much more performant.
> Problem with indexedtable is in fact, that on every insert
> hbase performs one (random) get operation (to check, that we doesn't
> have previous indexed data, and delete if it exists).  Random gets are
> lays around 100-400 request per node, so you get 60 looks good
> (for indexedtable).
>
> How to build custom indexes you can read
>
> http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
>
>
> 2010/9/2 Murali Krishna. P <mu...@yahoo.com>:
> > Hi,
> >    I have an indexedtable with index on around 20 columns. The write
> > performance on the original table is around 60 per second. This is just a
> one
> > node setup. Even with mutiple parallel clients, I am getting just 60
> > writes/second. That means a total write of 60 * 20 = 1200 writes/second
> due to
> > 20 indextables? This is not good enough for our application. Is this
> number
> >1200
> > look right ? I was expecting around 15k.
> >    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram,
> 2core,
> > 7.2k rpm disk). Will adding nodes increase the writes linearly?
> >
> >  Thanks,
> > Murali Krishna
> >
>

Re: HBase secondary index performance

Posted by "Murali Krishna. P" <mu...@yahoo.com>.
> Please clarify how this index table serves 20 columns - in the above schema,
> columnValue would be different for the 20 columns indexed, I assume.

My query to the index table will be columnValue + columnName. This is for exact 
match, if you need scan on partial value, we have to reverse the key 
generation-> cName+ cValue  + rowKey. I went for this schema to reduce the 
number of tables involved.

Thanks,
Murali Krishna




________________________________
From: Ted Yu <yu...@gmail.com>
To: user@hbase.apache.org
Sent: Mon, 6 September, 2010 7:23:22 PM
Subject: Re: HBase secondary index performance

> My key to the index table is columnValue+columnName+rowKey.
You need to consider the distribution of the above key so that write to
index table doesn't become bottleneck in the write path.

Please clarify how this index table serves 20 columns - in the above schema,
columnValue would be different for the 20 columns indexed, I assume.

On Sun, Sep 5, 2010 at 10:02 PM, Murali Krishna. P
<mu...@yahoo.com>wrote:

> Hi,
>   My row size is around 300 bytes with total 20 columns. I tried the custom
> indexing without the write to WAL. Currently having only 2 tables, one for
> the
> main table and another for all 20 indexes. My key to the index table is
> columnValue+columnName+rowKey.
> I am getting around 500 inserts/second now. (ie, total of ~10K puts). This
> is
> probably comparable with your numbers based on the data size.
>  I have some doubts with the hbase write implementation.
> * Is this the max that we can achieve with any number of region servers?
> Why
> adding region servers not improving the write performance? Is it because
> when
> the data doesn't exist in the table, it always writes to one region ?
>
> * Probably writing to an existing, well distributed table might give better
> performance since the writes will be across machines ? In that case, if we
> have
> multiple tables (one per index), will it be better during the initial write
> itself (since each table has different region) ??
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <oc...@gmail.com>
> To: user@hbase.apache.org
> Sent: Sun, 5 September, 2010 11:54:45 PM
> Subject: Re: HBase secondary index performance
>
> 2010/9/5 Murali Krishna. P <mu...@yahoo.com>:
> > Hi,
> >        Thanks for the detailed explanation, I liked the idea of timestamp
> > check, this will be good enough for us and I can put a periodic MR
> cleaner.
> > However I need some help in understanding the 30K number that was
> claimed.
>
> Real insert rate will depend on size of row, size of write buffer etc.
> In case of simple row with one long  per row i got 30k requests/second
> (shown in hbase).
> Json serialised objects 100-700bytes each, with validation I can insert
> 2-6k
> objects (json) per second.
>
> With
> > the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index
> columns).
> > I understood that there arean additional reads that indextable does but
>  25X
> > improvement that you got is very impressive. Can you please help me to
> > understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)
>
> Did you try to insert data into non indexed region (disable
> indexedtables extension)?
> What numbers did you got?
>
> >
> >  Thanks,
> > Murali Krishna
> >
> >
> >
> >
> > ________________________________
> > From: Andrey Stepachev <oc...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Sun, 5 September, 2010 3:53:26 AM
> > Subject: Re: HBase secondary index performance
> >
> > 2010/9/3 Murali Krishna. P <mu...@yahoo.com>:
> >
> >>        * custom indexing is good, but our data keeps changing every day.
> So,
> >>probably
> >> indextable is the best option for us
> >
> > In case of custom indexing you can use timestamps to check, that index
> > record still valid.
> > (or ever simply recheck existance of the value)
> > Also you need regular index cleanup (mr job or some custom application).
> >
> > To index some row identified by 'key' having 'value', we can create
> > index table,
> > where key will be [value:key] and insert rows every time, when we insert
> > our values. We will got 30k rows/s/node.
> > When we want to find all 'value', we scan [value:0000, value:9999] and
> > find all keys,
> > which point to rows, containing values.
> > We scan index, random get rows, recheck, that index is still valid
> > (check value or timestamp, index timestamp should be >= value timestamp)
> and
> > return only valid values (may be we can even delete on the fly when we
> > got negative
> > result to automatically clenup stale data).
> >
> >
> >>        * Just added one more regionserver and it did not help. Actually
> it
> >went
> >>back
> >> to 60/s for some strange reason(with one client). The requests in the
> hbase
> ui
> >> is not uniform across 2 region servers. One server is doing around 2000
> and
> > the
> >> other 500. Probably once the region gets split and when we have lots of
> data,
> >> writes will improve ? (Now it is just writing to one region for the main
> > table)
> >
> > Looks like all data goes to one region server. Try to make more random
> writes
> > (may be you should make key as random uuid or other key randomization
> >technique)
> >
> >>        * Is there some way to do bulk load the indexedtable? Earlier I
> have
> >>used the
> >> bulk loader tool (mapreduce job which creates the regions offline) but
> not
> > sure
> >> whether it works with indexed table.
> >
> > No sure, but you can look at source code, and try to emulate indexing
> > operations in
> > your code after regular bulk loading.
> >
> >>
> >>
> >>  Thanks,
> >> Murali Krishna
> >>
> >>
> >
> > Andrey.
> >
>

Re: HBase secondary index performance

Posted by Ted Yu <yu...@gmail.com>.
> My key to the index table is columnValue+columnName+rowKey.
You need to consider the distribution of the above key so that write to
index table doesn't become bottleneck in the write path.

Please clarify how this index table serves 20 columns - in the above schema,
columnValue would be different for the 20 columns indexed, I assume.

On Sun, Sep 5, 2010 at 10:02 PM, Murali Krishna. P
<mu...@yahoo.com>wrote:

> Hi,
>   My row size is around 300 bytes with total 20 columns. I tried the custom
> indexing without the write to WAL. Currently having only 2 tables, one for
> the
> main table and another for all 20 indexes. My key to the index table is
> columnValue+columnName+rowKey.
> I am getting around 500 inserts/second now. (ie, total of ~10K puts). This
> is
> probably comparable with your numbers based on the data size.
>  I have some doubts with the hbase write implementation.
> * Is this the max that we can achieve with any number of region servers?
> Why
> adding region servers not improving the write performance? Is it because
> when
> the data doesn't exist in the table, it always writes to one region ?
>
> * Probably writing to an existing, well distributed table might give better
> performance since the writes will be across machines ? In that case, if we
> have
> multiple tables (one per index), will it be better during the initial write
> itself (since each table has different region) ??
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <oc...@gmail.com>
> To: user@hbase.apache.org
> Sent: Sun, 5 September, 2010 11:54:45 PM
> Subject: Re: HBase secondary index performance
>
> 2010/9/5 Murali Krishna. P <mu...@yahoo.com>:
> > Hi,
> >        Thanks for the detailed explanation, I liked the idea of timestamp
> > check, this will be good enough for us and I can put a periodic MR
> cleaner.
> > However I need some help in understanding the 30K number that was
> claimed.
>
> Real insert rate will depend on size of row, size of write buffer etc.
> In case of simple row with one long  per row i got 30k requests/second
> (shown in hbase).
> Json serialised objects 100-700bytes each, with validation I can insert
> 2-6k
> objects (json) per second.
>
> With
> > the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index
> columns).
> > I understood that there arean additional reads that indextable does but
>  25X
> > improvement that you got is very impressive. Can you please help me to
> > understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)
>
> Did you try to insert data into non indexed region (disable
> indexedtables extension)?
> What numbers did you got?
>
> >
> >  Thanks,
> > Murali Krishna
> >
> >
> >
> >
> > ________________________________
> > From: Andrey Stepachev <oc...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Sun, 5 September, 2010 3:53:26 AM
> > Subject: Re: HBase secondary index performance
> >
> > 2010/9/3 Murali Krishna. P <mu...@yahoo.com>:
> >
> >>        * custom indexing is good, but our data keeps changing every day.
> So,
> >>probably
> >> indextable is the best option for us
> >
> > In case of custom indexing you can use timestamps to check, that index
> > record still valid.
> > (or ever simply recheck existance of the value)
> > Also you need regular index cleanup (mr job or some custom application).
> >
> > To index some row identified by 'key' having 'value', we can create
> > index table,
> > where key will be [value:key] and insert rows every time, when we insert
> > our values. We will got 30k rows/s/node.
> > When we want to find all 'value', we scan [value:0000, value:9999] and
> > find all keys,
> > which point to rows, containing values.
> > We scan index, random get rows, recheck, that index is still valid
> > (check value or timestamp, index timestamp should be >= value timestamp)
> and
> > return only valid values (may be we can even delete on the fly when we
> > got negative
> > result to automatically clenup stale data).
> >
> >
> >>        * Just added one more regionserver and it did not help. Actually
> it
> >went
> >>back
> >> to 60/s for some strange reason(with one client). The requests in the
> hbase
> ui
> >> is not uniform across 2 region servers. One server is doing around 2000
> and
> > the
> >> other 500. Probably once the region gets split and when we have lots of
> data,
> >> writes will improve ? (Now it is just writing to one region for the main
> > table)
> >
> > Looks like all data goes to one region server. Try to make more random
> writes
> > (may be you should make key as random uuid or other key randomization
> >technique)
> >
> >>        * Is there some way to do bulk load the indexedtable? Earlier I
> have
> >>used the
> >> bulk loader tool (mapreduce job which creates the regions offline) but
> not
> > sure
> >> whether it works with indexed table.
> >
> > No sure, but you can look at source code, and try to emulate indexing
> > operations in
> > your code after regular bulk loading.
> >
> >>
> >>
> >>  Thanks,
> >> Murali Krishna
> >>
> >>
> >
> > Andrey.
> >
>

Re: HBase secondary index performance

Posted by Andrey Stepachev <oc...@gmail.com>.
2010/9/6 Murali Krishna. P <mu...@yahoo.com>:
> Hi,
>   My row size is around 300 bytes with total 20 columns. I tried the custom
> indexing without the write to WAL. Currently having only 2 tables, one for the
> main table and another for all 20 indexes. My key to the index table is
> columnValue+columnName+rowKey.

As mentioned before, you can randomize you index insertions.
If you don't order scan or range scan on columnValue, you can
prefix it with some hash, f.e. sha(columnValue) + columnValue +
columnName + rowKey.
This remove hotspot in one of your region servers.

> I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is
> probably comparable with your numbers based on the data size.
Are all region servers get equal load, or some servers are more busy,
then others?

>  I have some doubts with the hbase write implementation.
> * Is this the max that we can achieve with any number of region servers? Why
> adding region servers not improving the write performance? Is it because when
> the data doesn't exist in the table, it always writes to one region ?
In general - yes. Before tables splits, you will get all writes into
one region server.

> * Probably writing to an existing, well distributed table might give better
> performance since the writes will be across machines ? In that case, if we have
> multiple tables (one per index), will it be better during the initial write
> itself (since each table has different region) ??
More servers affect the recording, the better.

 Andrey.

Re: HBase secondary index performance

Posted by "Murali Krishna. P" <mu...@yahoo.com>.
Hi,
   My row size is around 300 bytes with total 20 columns. I tried the custom 
indexing without the write to WAL. Currently having only 2 tables, one for the 
main table and another for all 20 indexes. My key to the index table is 
columnValue+columnName+rowKey.
I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is 
probably comparable with your numbers based on the data size.
  I have some doubts with the hbase write implementation. 
* Is this the max that we can achieve with any number of region servers? Why 
adding region servers not improving the write performance? Is it because when 
the data doesn't exist in the table, it always writes to one region ? 

* Probably writing to an existing, well distributed table might give better 
performance since the writes will be across machines ? In that case, if we have 
multiple tables (one per index), will it be better during the initial write 
itself (since each table has different region) ??

 Thanks,
Murali Krishna




________________________________
From: Andrey Stepachev <oc...@gmail.com>
To: user@hbase.apache.org
Sent: Sun, 5 September, 2010 11:54:45 PM
Subject: Re: HBase secondary index performance

2010/9/5 Murali Krishna. P <mu...@yahoo.com>:
> Hi,
>        Thanks for the detailed explanation, I liked the idea of timestamp
> check, this will be good enough for us and I can put a periodic MR cleaner.
> However I need some help in understanding the 30K number that was claimed.

Real insert rate will depend on size of row, size of write buffer etc.
In case of simple row with one long  per row i got 30k requests/second
(shown in hbase).
Json serialised objects 100-700bytes each, with validation I can insert 2-6k
objects (json) per second.

With
> the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index 
columns).
> I understood that there arean additional reads that indextable does but  25X
> improvement that you got is very impressive. Can you please help me to
> understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

Did you try to insert data into non indexed region (disable
indexedtables extension)?
What numbers did you got?

>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <oc...@gmail.com>
> To: user@hbase.apache.org
> Sent: Sun, 5 September, 2010 3:53:26 AM
> Subject: Re: HBase secondary index performance
>
> 2010/9/3 Murali Krishna. P <mu...@yahoo.com>:
>
>>        * custom indexing is good, but our data keeps changing every day. So,
>>probably
>> indextable is the best option for us
>
> In case of custom indexing you can use timestamps to check, that index
> record still valid.
> (or ever simply recheck existance of the value)
> Also you need regular index cleanup (mr job or some custom application).
>
> To index some row identified by 'key' having 'value', we can create
> index table,
> where key will be [value:key] and insert rows every time, when we insert
> our values. We will got 30k rows/s/node.
> When we want to find all 'value', we scan [value:0000, value:9999] and
> find all keys,
> which point to rows, containing values.
> We scan index, random get rows, recheck, that index is still valid
> (check value or timestamp, index timestamp should be >= value timestamp) and
> return only valid values (may be we can even delete on the fly when we
> got negative
> result to automatically clenup stale data).
>
>
>>        * Just added one more regionserver and it did not help. Actually it 
>went
>>back
>> to 60/s for some strange reason(with one client). The requests in the hbase 
ui
>> is not uniform across 2 region servers. One server is doing around 2000 and
> the
>> other 500. Probably once the region gets split and when we have lots of data,
>> writes will improve ? (Now it is just writing to one region for the main
> table)
>
> Looks like all data goes to one region server. Try to make more random writes
> (may be you should make key as random uuid or other key randomization 
>technique)
>
>>        * Is there some way to do bulk load the indexedtable? Earlier I have
>>used the
>> bulk loader tool (mapreduce job which creates the regions offline) but not
> sure
>> whether it works with indexed table.
>
> No sure, but you can look at source code, and try to emulate indexing
> operations in
> your code after regular bulk loading.
>
>>
>>
>>  Thanks,
>> Murali Krishna
>>
>>
>
> Andrey.
>

Re: HBase secondary index performance

Posted by Andrey Stepachev <oc...@gmail.com>.
2010/9/5 Murali Krishna. P <mu...@yahoo.com>:
> Hi,
>        Thanks for the detailed explanation, I liked the idea of timestamp
> check, this will be good enough for us and I can put a periodic MR cleaner.
> However I need some help in understanding the 30K number that was claimed.

Real insert rate will depend on size of row, size of write buffer etc.
In case of simple row with one long  per row i got 30k requests/second
(shown in hbase).
Json serialised objects 100-700bytes each, with validation I can insert 2-6k
objects (json) per second.

With
> the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
> I understood that there arean additional reads that indextable does but  25X
> improvement that you got is very impressive. Can you please help me to
> understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

Did you try to insert data into non indexed region (disable
indexedtables extension)?
What numbers did you got?

>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <oc...@gmail.com>
> To: user@hbase.apache.org
> Sent: Sun, 5 September, 2010 3:53:26 AM
> Subject: Re: HBase secondary index performance
>
> 2010/9/3 Murali Krishna. P <mu...@yahoo.com>:
>
>>        * custom indexing is good, but our data keeps changing every day. So,
>>probably
>> indextable is the best option for us
>
> In case of custom indexing you can use timestamps to check, that index
> record still valid.
> (or ever simply recheck existance of the value)
> Also you need regular index cleanup (mr job or some custom application).
>
> To index some row identified by 'key' having 'value', we can create
> index table,
> where key will be [value:key] and insert rows every time, when we insert
> our values. We will got 30k rows/s/node.
> When we want to find all 'value', we scan [value:0000, value:9999] and
> find all keys,
> which point to rows, containing values.
> We scan index, random get rows, recheck, that index is still valid
> (check value or timestamp, index timestamp should be >= value timestamp) and
> return only valid values (may be we can even delete on the fly when we
> got negative
> result to automatically clenup stale data).
>
>
>>        * Just added one more regionserver and it did not help. Actually it went
>>back
>> to 60/s for some strange reason(with one client). The requests in the hbase ui
>> is not uniform across 2 region servers. One server is doing around 2000 and
> the
>> other 500. Probably once the region gets split and when we have lots of data,
>> writes will improve ? (Now it is just writing to one region for the main
> table)
>
> Looks like all data goes to one region server. Try to make more random writes
> (may be you should make key as random uuid or other key randomization technique)
>
>>        * Is there some way to do bulk load the indexedtable? Earlier I have
>>used the
>> bulk loader tool (mapreduce job which creates the regions offline) but not
> sure
>> whether it works with indexed table.
>
> No sure, but you can look at source code, and try to emulate indexing
> operations in
> your code after regular bulk loading.
>
>>
>>
>>  Thanks,
>> Murali Krishna
>>
>>
>
> Andrey.
>

Re: HBase secondary index performance

Posted by "Murali Krishna. P" <mu...@yahoo.com>.
Hi,
        Thanks for the detailed explanation, I liked the idea of timestamp 
check, this will be good enough for us and I can put a periodic MR cleaner. 
However I need some help in understanding the 30K number that was claimed. With 
the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns). 
I understood that there arean additional reads that indextable does but  25X 
improvement that you got is very impressive. Can you please help me to 
understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

 Thanks,
Murali Krishna




________________________________
From: Andrey Stepachev <oc...@gmail.com>
To: user@hbase.apache.org
Sent: Sun, 5 September, 2010 3:53:26 AM
Subject: Re: HBase secondary index performance

2010/9/3 Murali Krishna. P <mu...@yahoo.com>:

>        * custom indexing is good, but our data keeps changing every day. So, 
>probably
> indextable is the best option for us

In case of custom indexing you can use timestamps to check, that index
record still valid.
(or ever simply recheck existance of the value)
Also you need regular index cleanup (mr job or some custom application).

To index some row identified by 'key' having 'value', we can create
index table,
where key will be [value:key] and insert rows every time, when we insert
our values. We will got 30k rows/s/node.
When we want to find all 'value', we scan [value:0000, value:9999] and
find all keys,
which point to rows, containing values.
We scan index, random get rows, recheck, that index is still valid
(check value or timestamp, index timestamp should be >= value timestamp) and
return only valid values (may be we can even delete on the fly when we
got negative
result to automatically clenup stale data).


>        * Just added one more regionserver and it did not help. Actually it went 
>back
> to 60/s for some strange reason(with one client). The requests in the hbase ui
> is not uniform across 2 region servers. One server is doing around 2000 and 
the
> other 500. Probably once the region gets split and when we have lots of data,
> writes will improve ? (Now it is just writing to one region for the main 
table)

Looks like all data goes to one region server. Try to make more random writes
(may be you should make key as random uuid or other key randomization technique)

>        * Is there some way to do bulk load the indexedtable? Earlier I have 
>used the
> bulk loader tool (mapreduce job which creates the regions offline) but not 
sure
> whether it works with indexed table.

No sure, but you can look at source code, and try to emulate indexing
operations in
your code after regular bulk loading.

>
>
>  Thanks,
> Murali Krishna
>
>

Andrey.

Re: HBase secondary index performance

Posted by Andrey Stepachev <oc...@gmail.com>.
2010/9/5 Samuru Jackson <sa...@googlemail.com>:
> Hi,
>
>> where key will be [value:key] and insert rows every time, when we insert
>> our values. We will got 30k rows/s/node.
>
> Could you specify on what kind of hardware you did this?
3 node "cluster",  16Gb core2duo. sas raid10.

> How did you design your indexer? Is it multithreaded?
It is not and indexer, It is abstraction around HTable, which
does put plus additional puts (as described before) into index
tables. Later (i don't have actual date now), i release this
code, but it is not a rocket science.

30k - it is peak requests/ps not a constant rate. Effective rows
(json objects with 1-2 indexes on them and 100-500bytes) i got
1-3k objects per node.

>
> /SJ
> -----------
> http://uncinuscloud.blogspot.com/
>

Re: HBase secondary index performance

Posted by Samuru Jackson <sa...@googlemail.com>.
Hi,

> where key will be [value:key] and insert rows every time, when we insert
> our values. We will got 30k rows/s/node.

Could you specify on what kind of hardware you did this? How did you
design your indexer? Is it multithreaded?

/SJ
-----------
http://uncinuscloud.blogspot.com/

Re: HBase secondary index performance

Posted by Andrey Stepachev <oc...@gmail.com>.
2010/9/3 Murali Krishna. P <mu...@yahoo.com>:

>        * custom indexing is good, but our data keeps changing every day. So, probably
> indextable is the best option for us

In case of custom indexing you can use timestamps to check, that index
record still valid.
(or ever simply recheck existance of the value)
Also you need regular index cleanup (mr job or some custom application).

To index some row identified by 'key' having 'value', we can create
index table,
where key will be [value:key] and insert rows every time, when we insert
our values. We will got 30k rows/s/node.
When we want to find all 'value', we scan [value:0000, value:9999] and
find all keys,
which point to rows, containing values.
We scan index, random get rows, recheck, that index is still valid
(check value or timestamp, index timestamp should be >= value timestamp) and
return only valid values (may be we can even delete on the fly when we
got negative
result to automatically clenup stale data).


>        * Just added one more regionserver and it did not help. Actually it went back
> to 60/s for some strange reason(with one client). The requests in the hbase ui
> is not uniform across 2 region servers. One server is doing around 2000 and the
> other 500. Probably once the region gets split and when we have lots of data,
> writes will improve ? (Now it is just writing to one region for the main table)

Looks like all data goes to one region server. Try to make more random writes
(may be you should make key as random uuid or other key randomization technique)

>        * Is there some way to do bulk load the indexedtable? Earlier I have used the
> bulk loader tool (mapreduce job which creates the regions offline) but not sure
> whether it works with indexed table.

No sure, but you can look at source code, and try to emulate indexing
operations in
your code after regular bulk loading.

>
>
>  Thanks,
> Murali Krishna
>
>

Andrey.

Re: HBase secondary index performance

Posted by Todd Lipcon <to...@cloudera.com>.
On Fri, Sep 3, 2010 at 7:57 AM, Michael Segel <mi...@hotmail.com>wrote:

>
>
>
> > Date: Fri, 3 Sep 2010 18:00:42 +0530
> > From: muralikpbhat@yahoo.com
> > Subject: Re: HBase secondary index performance
> > To: user@hbase.apache.org
> >
> > Thanks Andrey,
> >
> >       * Setting the autoflush to false and increasing the writeBuffer
> size to 12MB
> > improved the writes to 100/s
> >       * custom indexing is good, but our data keeps changing every day.
> So, probably
> > indextable is the best option for us
> >       * Just added one more regionserver and it did not help. Actually it
> went back
> > to 60/s for some strange reason(with one client). The requests in the
> hbase ui
> > is not uniform across 2 region servers. One server is doing around 2000
> and the
> > other 500. Probably once the region gets split and when we have lots of
> data,
> > writes will improve ? (Now it is just writing to one region for the main
> table)
> >       * Is there some way to do bulk load the indexedtable? Earlier I
> have used the
> > bulk loader tool (mapreduce job which creates the regions offline) but
> not sure
> > whether it works with indexed table.
>
> Just a small suggestion...
>
> If you have a table that is populated and you add a new region server, your
> data isn't going to balance itself out.
> If you want to balance your existing data, you'll need to bring down hbase,
> then run hadoop's balancer app. When its completed, you'll see that your
> data is now spread more evenly across the cloud. Please remember that you
> need to have HBase down when you run the balancer app.
>
>
>
The above is all incorrect.

The data *will* balance itself out on HDFS after major compactions have
taken place, and even before that, the regions *will* balance themselves
across region servers.

Running the balancer while HBase is running is also perfectly safe, though
it is not necessary for performance reasons.

-Todd


>




-- 
Todd Lipcon
Software Engineer, Cloudera

RE: HBase secondary index performance

Posted by Michael Segel <mi...@hotmail.com>.


> Date: Fri, 3 Sep 2010 18:00:42 +0530
> From: muralikpbhat@yahoo.com
> Subject: Re: HBase secondary index performance
> To: user@hbase.apache.org
> 
> Thanks Andrey,
> 
> 	* Setting the autoflush to false and increasing the writeBuffer size to 12MB 
> improved the writes to 100/s
> 	* custom indexing is good, but our data keeps changing every day. So, probably 
> indextable is the best option for us
> 	* Just added one more regionserver and it did not help. Actually it went back 
> to 60/s for some strange reason(with one client). The requests in the hbase ui 
> is not uniform across 2 region servers. One server is doing around 2000 and the 
> other 500. Probably once the region gets split and when we have lots of data, 
> writes will improve ? (Now it is just writing to one region for the main table)
> 	* Is there some way to do bulk load the indexedtable? Earlier I have used the 
> bulk loader tool (mapreduce job which creates the regions offline) but not sure 
> whether it works with indexed table. 

Just a small suggestion...

If you have a table that is populated and you add a new region server, your data isn't going to balance itself out.
If you want to balance your existing data, you'll need to bring down hbase, then run hadoop's balancer app. When its completed, you'll see that your data is now spread more evenly across the cloud. Please remember that you need to have HBase down when you run the balancer app.


 		 	   		  

Re: HBase secondary index performance

Posted by "Murali Krishna. P" <mu...@yahoo.com>.
Thanks Andrey,

	* Setting the autoflush to false and increasing the writeBuffer size to 12MB 
improved the writes to 100/s
	* custom indexing is good, but our data keeps changing every day. So, probably 
indextable is the best option for us
	* Just added one more regionserver and it did not help. Actually it went back 
to 60/s for some strange reason(with one client). The requests in the hbase ui 
is not uniform across 2 region servers. One server is doing around 2000 and the 
other 500. Probably once the region gets split and when we have lots of data, 
writes will improve ? (Now it is just writing to one region for the main table)
	* Is there some way to do bulk load the indexedtable? Earlier I have used the 
bulk loader tool (mapreduce job which creates the regions offline) but not sure 
whether it works with indexed table. 


 Thanks,
Murali Krishna




________________________________
From: Andrey Stepachev <oc...@gmail.com>
To: user@hbase.apache.org
Sent: Fri, 3 September, 2010 12:14:29 AM
Subject: Re: HBase secondary index performance

First, check that you connection not in autoflash mode.
Second, you can think about custom indexing instead
of using indexedtable. In my experience custom idexing
(especially if data doesn't modified), is much more performant.
Problem with indexedtable is in fact, that on every insert
hbase performs one (random) get operation (to check, that we doesn't
have previous indexed data, and delete if it exists).  Random gets are
lays around 100-400 request per node, so you get 60 looks good
(for indexedtable).

How to build custom indexes you can read
http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/


2010/9/2 Murali Krishna. P <mu...@yahoo.com>:
> Hi,
>    I have an indexedtable with index on around 20 columns. The write
> performance on the original table is around 60 per second. This is just a one
> node setup. Even with mutiple parallel clients, I am getting just 60
> writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
> 20 indextables? This is not good enough for our application. Is this number 
>1200
> look right ? I was expecting around 15k.
>    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
> 7.2k rpm disk). Will adding nodes increase the writes linearly?
>
>  Thanks,
> Murali Krishna
>

Re: HBase secondary index performance

Posted by Andrey Stepachev <oc...@gmail.com>.
First, check that you connection not in autoflash mode.
Second, you can think about custom indexing instead
of using indexedtable. In my experience custom idexing
(especially if data doesn't modified), is much more performant.
Problem with indexedtable is in fact, that on every insert
hbase performs one (random) get operation (to check, that we doesn't
have previous indexed data, and delete if it exists).  Random gets are
lays around 100-400 request per node, so you get 60 looks good
(for indexedtable).

How to build custom indexes you can read
http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/

2010/9/2 Murali Krishna. P <mu...@yahoo.com>:
> Hi,
>    I have an indexedtable with index on around 20 columns. The write
> performance on the original table is around 60 per second. This is just a one
> node setup. Even with mutiple parallel clients, I am getting just 60
> writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
> 20 indextables? This is not good enough for our application. Is this number 1200
> look right ? I was expecting around 15k.
>    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
> 7.2k rpm disk). Will adding nodes increase the writes linearly?
>
>  Thanks,
> Murali Krishna
>