You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Andre Reiter <a....@web.de> on 2011/10/28 16:36:05 UTC

High Throughput using row keys based on the current time

Hi everybody,

we have the following scenario:
our clustered web application needs to write records to hbase, we need to support a very high throughput, we expect up to 10-30 thousends requests per second and may be even more

so usually it is not a problem for HBase, if we use a "random" row key; in this case the data is distributed between all region servers equally
but, we need to generate our keys based on the current time, so we are able to run MR jobs for a period of time without processing the whole data, using
   scan.setStartRow(stopRow);
   scan.setStopRow(startRow);

in our case the generated row keys look similar and are there for going to the same region server... so this approach is not really using the power of the whole cluster, but only one server, which can be dangerous in case of a very high load

so, we are thinking about writing the records first to a HDFS file, and run additionally a MR job periodically to read the finnished HDFS files and insert the records to HBase

what do you guys think about it? any suggestions would be very appreciated

regards
andre