You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Andreas Reiter <a....@web.de> on 2011/10/28 16:08:17 UTC

High Throughput using row keys based on the current time

Hi everybody,

we have the following scenario:
our clustered web application needs to write records to hbase, we need to support a very high throughput, we expect up to 10-30 thousends requests per second and may be even more

so usually it is not a problem for HBase, if we use a "random" row key; in this case the data is distributed between all region servers equally
but, we need to generate our keys based on the current time, so we are able to run MR jobs for a period of time without processing the whole data, using
   scan.setStartRow(stopRow);
   scan.setStopRow(startRow);

in our case the generated row keys look similar and are there for going to the same region server... so this approach is not really using the power of the whole cluster, but only one server, which can be dangerous in case of a very high load

so, we are thinking about writing the records first to a HDFS file, and run additionally a MR job periodically to read the finnished HDFS files and insert the records to HBase

what do you guys think about it? any suggestions would be very appreciated

regards
andre

Re: High Throughput using row keys based on the current time

Posted by Joey Echeverria <jo...@cloudera.com>.
Have you looked into bulk imports? You can write your data into HDFS
and then run a MapReduce job to generate the files that HBase uses to
serve data. After the job finishes, there's a utility to copy the
files into HBase's directory and your data is visible. Check out
http://hbase.apache.org/bulk-loads.html for details.

-Joey

On Fri, Oct 28, 2011 at 10:08 AM, Andreas Reiter <a....@web.de> wrote:
> Hi everybody,
>
> we have the following scenario:
> our clustered web application needs to write records to hbase, we need to
> support a very high throughput, we expect up to 10-30 thousends requests per
> second and may be even more
>
> so usually it is not a problem for HBase, if we use a "random" row key; in
> this case the data is distributed between all region servers equally
> but, we need to generate our keys based on the current time, so we are able
> to run MR jobs for a period of time without processing the whole data, using
>  scan.setStartRow(stopRow);
>  scan.setStopRow(startRow);
>
> in our case the generated row keys look similar and are there for going to
> the same region server... so this approach is not really using the power of
> the whole cluster, but only one server, which can be dangerous in case of a
> very high load
>
> so, we are thinking about writing the records first to a HDFS file, and run
> additionally a MR job periodically to read the finnished HDFS files and
> insert the records to HBase
>
> what do you guys think about it? any suggestions would be very appreciated
>
> regards
> andre
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Mahout and Hadoop

Posted by be...@gmail.com.
Hey Bish
        AFAIK Cloudera repository has mahout now. So may be that should be included in the latest CDH3u2 demo VM from cloudera as well, but I'm not sure since I haven't checked the same yet. Please check the cloudera downloads for more details.
      AFAIK Mahout is just a collection of classes  and hence does it require some installation? If you can take some mahout releases 0.5 you may not even need to do the build. If you are taking the latest trunk then you have to do a build. I have played around with mahout CF some time back and I don't remember myself installing anything.

Regards
Bejoy K S

-----Original Message-----
From: Bish Maten <bi...@yahoo.com>
Date: Wed, 16 Nov 2011 11:29:03 
To: mapreduce-user@hadoop.apache.org<ma...@hadoop.apache.org>
Reply-To: mapreduce-user@hadoop.apache.org
Subject: Mahout and Hadoop



I have Hadoop installed on Ubuntu VM  and next need Mahout installed on this Ubuntu virtual machine.
I was hoping if there is already preconfigured Hadoop and Mahout available? Hadoop install was simple but the Mahout install appears to have some specific installation directories, dependencies etc. So any pointers to get the Mahout up and ready would be appreciated, else if there is already a preconfigured virtual machine available?

Mahout and Hadoop

Posted by Bish Maten <bi...@yahoo.com>.

I have Hadoop installed on Ubuntu VM  and next need Mahout installed on this Ubuntu virtual machine.
I was hoping if there is already preconfigured Hadoop and Mahout available? Hadoop install was simple but the Mahout install appears to have some specific installation directories, dependencies etc. So any pointers to get the Mahout up and ready would be appreciated, else if there is already a preconfigured virtual machine available?

Re: WordCount example for double

Posted by Brock Noland <br...@cloudera.com>.
Everywhere IntWritable occurs change it to DoubleWritable.

On Fri, Oct 28, 2011 at 10:36 AM, Bish Maten <bi...@yahoo.com> wrote:
> http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html
>
> this example gives output in integer
> Is there a version for double?
> Thanks

WordCount example for double

Posted by Bish Maten <bi...@yahoo.com>.
http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html


this example gives output in integer 
Is there a version for double?  

Thanks