You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Pal Konyves <pa...@gmail.com> on 2013/05/08 04:01:24 UTC

EC2 Elastic MapReduce HBase install recommendations

Hi,

Has anyone got some recommendations about running HBase on EC2? I am
testing it, and so far I am very disappointed with it. I did not change
anything about the default 'Amazon distribution' installation. It has one
MasterNode and two slave nodes, and write performance is around 2500 small
rows per sec at most, but I expected it to be way  better. Oh, and this is
with batch put operations with autocommit turned off, where each batch
containes about 500-1000 rows... When I do it with autocommit, it does not
even reach the 1000 rows per sec.

Every nodes were m1.Large ones.

Any experiences, suggestions? Is it worth to try the RMap distribution
instead of the amazon one?

Thanks,
Pal

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by ramkrishna vasudevan <ra...@gmail.com>.
Your EC2 instance is having EBS or instance type as the data store?
If it is EBS then the latency is bit high and this is as per Andrew's
experience.

Regards
Ram


On Wed, May 8, 2013 at 8:01 AM, Marcos Luis Ortiz Valmaseda <
marcosluis2186@gmail.com> wrote:

> I think that Andrew talked about this some years ago and he created some
> scripts for that. You can find them here:
> https://github.com/apurtell/hbase-ec2
>
> Then, you can review some links about this topic:
>
> http://blog.cloudera.com/blog/2012/10/set-up-a-hadoophbase-cluster-on-ec2-in-about-an-hour/
>
> http://my.safaribooksonline.com/book/databases/storage-systems/9781849517140/1dot-setting-up-hbase-cluster/id286696951
>
> http://whynosql.com/why-we-run-our-hbase-on-ec2/
>
> You can read the HBase on EC2 demo from Andrew in the HBaseCon 2012:
> https://github.com/apurtell/ec2-demo
>
>
>
>
> 2013/5/7 Pal Konyves <pa...@gmail.com>
>
> > Hi,
> >
> > Has anyone got some recommendations about running HBase on EC2? I am
> > testing it, and so far I am very disappointed with it. I did not change
> > anything about the default 'Amazon distribution' installation. It has one
> > MasterNode and two slave nodes, and write performance is around 2500
> small
> > rows per sec at most, but I expected it to be way  better. Oh, and this
> is
> > with batch put operations with autocommit turned off, where each batch
> > containes about 500-1000 rows... When I do it with autocommit, it does
> not
> > even reach the 1000 rows per sec.
> >
> > Every nodes were m1.Large ones.
> >
> > Any experiences, suggestions? Is it worth to try the RMap distribution
> > instead of the amazon one?
> >
> > Thanks,
> > Pal
> >
>
>
>
> --
> Marcos Ortiz Valmaseda
> Product Manager at PDVSA
> http://about.me/marcosortiz
>

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Marcos Luis Ortiz Valmaseda <ma...@gmail.com>.
I think that Andrew talked about this some years ago and he created some
scripts for that. You can find them here:
https://github.com/apurtell/hbase-ec2

Then, you can review some links about this topic:
http://blog.cloudera.com/blog/2012/10/set-up-a-hadoophbase-cluster-on-ec2-in-about-an-hour/
http://my.safaribooksonline.com/book/databases/storage-systems/9781849517140/1dot-setting-up-hbase-cluster/id286696951

http://whynosql.com/why-we-run-our-hbase-on-ec2/

You can read the HBase on EC2 demo from Andrew in the HBaseCon 2012:
https://github.com/apurtell/ec2-demo




2013/5/7 Pal Konyves <pa...@gmail.com>

> Hi,
>
> Has anyone got some recommendations about running HBase on EC2? I am
> testing it, and so far I am very disappointed with it. I did not change
> anything about the default 'Amazon distribution' installation. It has one
> MasterNode and two slave nodes, and write performance is around 2500 small
> rows per sec at most, but I expected it to be way  better. Oh, and this is
> with batch put operations with autocommit turned off, where each batch
> containes about 500-1000 rows... When I do it with autocommit, it does not
> even reach the 1000 rows per sec.
>
> Every nodes were m1.Large ones.
>
> Any experiences, suggestions? Is it worth to try the RMap distribution
> instead of the amazon one?
>
> Thanks,
> Pal
>



-- 
Marcos Ortiz Valmaseda
Product Manager at PDVSA
http://about.me/marcosortiz

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Asaf Mesika <as...@gmail.com>.
We ran into that as well.
You need to make sure when sending List of Put that all rowkeys there are
unique, otherwise as Ted said, the for loop acquiring locks will run
multiple times for rowkey which repeats it self

On Sunday, May 12, 2013, Ted Yu wrote:

> High collision rate means high contention at taking the row locks.
> This results in poor write performance.
>
> Cheers
>
> On May 11, 2013, at 7:14 PM, Pal Konyves <pa...@gmail.com> wrote:
>
> > Hi,
> >
> > I decided not to make any tuning, because my whole project is about
> > experimenting with HBase (it's a scool project). However it turned out
> that
> > my sample data generated lots of rowkey collisions. 4 million inserts
> only
> > resulted in about 5000 rows. The data were different though in the
> columns.
> > When I changed my sample dataset to have no collisions in the rowkey, the
> > performance increased with a magnitude of 10. Why is that?
> >
> > Thanks,
> > Pal
> >
> >
> > On Thu, May 9, 2013 at 2:32 PM, Michel Segel <michael_segel@hotmail.com
> >wrote:
> >
> >> What I am saying is that by default, you get two mappers per node.
> >> x4large can run HBase w more mapred slots, so you will want to tune the
> >> defaults based on machine size. Not just mapred, but also HBase stuff
> too.
> >> You need to do this on startup of EMR cluster though...
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On May 9, 2013, at 2:39 AM, Pal Konyves <pa...@gmail.com> wrote:
> >>
> >>> Principally I chose to use Amazon, because they are supposedly high
> >>> performance, and what more important is: HBase is already set up if I
> >> chose
> >>> it as an EMR Workflow. I wanted to save up the time setting up the
> >> cluster
> >>> manually on EC2 instances.
> >>>
> >>> Are you saying I will reach higher performance when I set up the HBase
> on
> >>> the cluster manually, instead of the default Amazon HBase distribution?
> >> Or
> >>> is it worth to tune the Amazon distribution with a bootstrap action?
> How
> >>> long does it take, to set up the cluster with HDFS manually?
> >>>
> >>> I will also try larger instance types.
> >>>
> >>>
> >>> On Thu, May 9, 2013 at 6:47 AM, Michel Segel <
> michael_segel@hotmail.com
> >>> wrote:
> >>>
> >>>> With respect to EMR, you can run HBase fairly easily.
> >>>> You can't run MapR w HBase on EMR stick w Amazon's release.
> >>>>
> >>>> And you can run it but you will want to know your tuning parameters up
> >>>> front when you instantiate it.
> >>>>
> >>>>
> >>>>
> >>>> Sent from a remote device. Please excuse any typos...
> >>>>
> >>>> Mike Segel
> >>>>
> >>>> On May 8, 2013, at 9:04 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >>>>
> >>>>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL
> >> datastore
> >>>>> with (I gather) an Apache HBase compatible Java API.
> >>>>>
> >>>>> As for running HBase on EC2, we recently discussed some particulars,
> >> see
> >>>>> the latter part of this thread:
> >> http://search-hadoop.com/m/rI1HpK90guwhere
> >>>>> I hijack it. I wouldn't recommend launching HBase as part of an EMR
> >> flow
> >>>>> unless you want to use it only for temporary random access storage,
> and
> >>>> in
> >>>>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set
> up
> >> a
> >>>>> dedicated HBase backed storage service on high I/O instance types.
> The
> >>>>> fundamental issue is IO performance on the EC2 platform is fair to
> >> poor.
> >>>>>
> >>>>> I have also noticed a large difference in baseline block device
> latency
> >>>> if
> >>>>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this
> >> year.
> >>>>> Use the new ones, they cut the latency long tail in half. There were
> >> some
> >>>>> significant kernel level improvements I gather.
> >>>>>
> >>>>>
> >>>>> On Wed, May 8, 2013 a

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Ted Yu <yu...@gmail.com>.
High collision rate means high contention at taking the row locks. 
This results in poor write performance. 

Cheers

On May 11, 2013, at 7:14 PM, Pal Konyves <pa...@gmail.com> wrote:

> Hi,
> 
> I decided not to make any tuning, because my whole project is about
> experimenting with HBase (it's a scool project). However it turned out that
> my sample data generated lots of rowkey collisions. 4 million inserts only
> resulted in about 5000 rows. The data were different though in the columns.
> When I changed my sample dataset to have no collisions in the rowkey, the
> performance increased with a magnitude of 10. Why is that?
> 
> Thanks,
> Pal
> 
> 
> On Thu, May 9, 2013 at 2:32 PM, Michel Segel <mi...@hotmail.com>wrote:
> 
>> What I am saying is that by default, you get two mappers per node.
>> x4large can run HBase w more mapred slots, so you will want to tune the
>> defaults based on machine size. Not just mapred, but also HBase stuff too.
>> You need to do this on startup of EMR cluster though...
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 9, 2013, at 2:39 AM, Pal Konyves <pa...@gmail.com> wrote:
>> 
>>> Principally I chose to use Amazon, because they are supposedly high
>>> performance, and what more important is: HBase is already set up if I
>> chose
>>> it as an EMR Workflow. I wanted to save up the time setting up the
>> cluster
>>> manually on EC2 instances.
>>> 
>>> Are you saying I will reach higher performance when I set up the HBase on
>>> the cluster manually, instead of the default Amazon HBase distribution?
>> Or
>>> is it worth to tune the Amazon distribution with a bootstrap action? How
>>> long does it take, to set up the cluster with HDFS manually?
>>> 
>>> I will also try larger instance types.
>>> 
>>> 
>>> On Thu, May 9, 2013 at 6:47 AM, Michel Segel <michael_segel@hotmail.com
>>> wrote:
>>> 
>>>> With respect to EMR, you can run HBase fairly easily.
>>>> You can't run MapR w HBase on EMR stick w Amazon's release.
>>>> 
>>>> And you can run it but you will want to know your tuning parameters up
>>>> front when you instantiate it.
>>>> 
>>>> 
>>>> 
>>>> Sent from a remote device. Please excuse any typos...
>>>> 
>>>> Mike Segel
>>>> 
>>>> On May 8, 2013, at 9:04 PM, Andrew Purtell <ap...@apache.org> wrote:
>>>> 
>>>>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL
>> datastore
>>>>> with (I gather) an Apache HBase compatible Java API.
>>>>> 
>>>>> As for running HBase on EC2, we recently discussed some particulars,
>> see
>>>>> the latter part of this thread:
>> http://search-hadoop.com/m/rI1HpK90guwhere
>>>>> I hijack it. I wouldn't recommend launching HBase as part of an EMR
>> flow
>>>>> unless you want to use it only for temporary random access storage, and
>>>> in
>>>>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up
>> a
>>>>> dedicated HBase backed storage service on high I/O instance types. The
>>>>> fundamental issue is IO performance on the EC2 platform is fair to
>> poor.
>>>>> 
>>>>> I have also noticed a large difference in baseline block device latency
>>>> if
>>>>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this
>> year.
>>>>> Use the new ones, they cut the latency long tail in half. There were
>> some
>>>>> significant kernel level improvements I gather.
>>>>> 
>>>>> 
>>>>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
>>>>> marcosluis2186@gmail.com> wrote:
>>>>> 
>>>>>> I think that you when you are talking about RMap, you are referring to
>>>>>> MapR´s distribution.
>>>>>> I think that MapR´s team released a very good version of its Hadoop
>>>>>> distribution focused on HBase called M7. You can see its overview
>> here:
>>>>>> http://www.mapr.com/products/mapr-editions/m7-edition
>>>>>> 
>>>>>> But this release was under beta testing, and I see that it´s not
>>>> included
>>>>>> in the Amazon Marketplace yet:
>> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 2013/5/7 Pal Konyves <pa...@gmail.com>
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Has anyone got some recommendations about running HBase on EC2? I am
>>>>>>> testing it, and so far I am very disappointed with it. I did not
>> change
>>>>>>> anything about the default 'Amazon distribution' installation. It has
>>>> one
>>>>>>> MasterNode and two slave nodes, and write performance is around 2500
>>>>>> small
>>>>>>> rows per sec at most, but I expected it to be way  better. Oh, and
>> this
>>>>>> is
>>>>>>> with batch put operations with autocommit turned off, where each
>> batch
>>>>>>> containes about 500-1000 rows... When I do it with autocommit, it
>> does
>>>>>> not
>>>>>>> even reach the 1000 rows per sec.
>>>>>>> 
>>>>>>> Every nodes were m1.Large ones.
>>>>>>> 
>>>>>>> Any experiences, suggestions? Is it worth to try the RMap
>> distribution
>>>>>>> instead of the amazon one?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Pal
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Marcos Ortiz Valmaseda
>>>>>> Product Manager at PDVSA
>>>>>> http://about.me/marcosortiz
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> 
>>>>> - Andy
>>>>> 
>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>>>>> (via Tom White)
>> 

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Pal Konyves <pa...@gmail.com>.
Hi,

I decided not to make any tuning, because my whole project is about
experimenting with HBase (it's a scool project). However it turned out that
my sample data generated lots of rowkey collisions. 4 million inserts only
resulted in about 5000 rows. The data were different though in the columns.
When I changed my sample dataset to have no collisions in the rowkey, the
performance increased with a magnitude of 10. Why is that?

Thanks,
Pal


On Thu, May 9, 2013 at 2:32 PM, Michel Segel <mi...@hotmail.com>wrote:

> What I am saying is that by default, you get two mappers per node.
> x4large can run HBase w more mapred slots, so you will want to tune the
> defaults based on machine size. Not just mapred, but also HBase stuff too.
> You need to do this on startup of EMR cluster though...
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 9, 2013, at 2:39 AM, Pal Konyves <pa...@gmail.com> wrote:
>
> > Principally I chose to use Amazon, because they are supposedly high
> > performance, and what more important is: HBase is already set up if I
> chose
> > it as an EMR Workflow. I wanted to save up the time setting up the
> cluster
> > manually on EC2 instances.
> >
> > Are you saying I will reach higher performance when I set up the HBase on
> > the cluster manually, instead of the default Amazon HBase distribution?
> Or
> > is it worth to tune the Amazon distribution with a bootstrap action? How
> > long does it take, to set up the cluster with HDFS manually?
> >
> > I will also try larger instance types.
> >
> >
> > On Thu, May 9, 2013 at 6:47 AM, Michel Segel <michael_segel@hotmail.com
> >wrote:
> >
> >> With respect to EMR, you can run HBase fairly easily.
> >> You can't run MapR w HBase on EMR stick w Amazon's release.
> >>
> >> And you can run it but you will want to know your tuning parameters up
> >> front when you instantiate it.
> >>
> >>
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On May 8, 2013, at 9:04 PM, Andrew Purtell <ap...@apache.org> wrote:
> >>
> >>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL
> datastore
> >>> with (I gather) an Apache HBase compatible Java API.
> >>>
> >>> As for running HBase on EC2, we recently discussed some particulars,
> see
> >>> the latter part of this thread:
> http://search-hadoop.com/m/rI1HpK90guwhere
> >>> I hijack it. I wouldn't recommend launching HBase as part of an EMR
> flow
> >>> unless you want to use it only for temporary random access storage, and
> >> in
> >>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up
> a
> >>> dedicated HBase backed storage service on high I/O instance types. The
> >>> fundamental issue is IO performance on the EC2 platform is fair to
> poor.
> >>>
> >>> I have also noticed a large difference in baseline block device latency
> >> if
> >>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this
> year.
> >>> Use the new ones, they cut the latency long tail in half. There were
> some
> >>> significant kernel level improvements I gather.
> >>>
> >>>
> >>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
> >>> marcosluis2186@gmail.com> wrote:
> >>>
> >>>> I think that you when you are talking about RMap, you are referring to
> >>>> MapR´s distribution.
> >>>> I think that MapR´s team released a very good version of its Hadoop
> >>>> distribution focused on HBase called M7. You can see its overview
> here:
> >>>> http://www.mapr.com/products/mapr-editions/m7-edition
> >>>>
> >>>> But this release was under beta testing, and I see that it´s not
> >> included
> >>>> in the Amazon Marketplace yet:
> >>
> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2013/5/7 Pal Konyves <pa...@gmail.com>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Has anyone got some recommendations about running HBase on EC2? I am
> >>>>> testing it, and so far I am very disappointed with it. I did not
> change
> >>>>> anything about the default 'Amazon distribution' installation. It has
> >> one
> >>>>> MasterNode and two slave nodes, and write performance is around 2500
> >>>> small
> >>>>> rows per sec at most, but I expected it to be way  better. Oh, and
> this
> >>>> is
> >>>>> with batch put operations with autocommit turned off, where each
> batch
> >>>>> containes about 500-1000 rows... When I do it with autocommit, it
> does
> >>>> not
> >>>>> even reach the 1000 rows per sec.
> >>>>>
> >>>>> Every nodes were m1.Large ones.
> >>>>>
> >>>>> Any experiences, suggestions? Is it worth to try the RMap
> distribution
> >>>>> instead of the amazon one?
> >>>>>
> >>>>> Thanks,
> >>>>> Pal
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Marcos Ortiz Valmaseda
> >>>> Product Manager at PDVSA
> >>>> http://about.me/marcosortiz
> >>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>>
> >>>  - Andy
> >>>
> >>> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> >>> (via Tom White)
> >>
>

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Michel Segel <mi...@hotmail.com>.
What I am saying is that by default, you get two mappers per node.
x4large can run HBase w more mapred slots, so you will want to tune the defaults based on machine size. Not just mapred, but also HBase stuff too. You need to do this on startup of EMR cluster though...

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 9, 2013, at 2:39 AM, Pal Konyves <pa...@gmail.com> wrote:

> Principally I chose to use Amazon, because they are supposedly high
> performance, and what more important is: HBase is already set up if I chose
> it as an EMR Workflow. I wanted to save up the time setting up the cluster
> manually on EC2 instances.
> 
> Are you saying I will reach higher performance when I set up the HBase on
> the cluster manually, instead of the default Amazon HBase distribution? Or
> is it worth to tune the Amazon distribution with a bootstrap action? How
> long does it take, to set up the cluster with HDFS manually?
> 
> I will also try larger instance types.
> 
> 
> On Thu, May 9, 2013 at 6:47 AM, Michel Segel <mi...@hotmail.com>wrote:
> 
>> With respect to EMR, you can run HBase fairly easily.
>> You can't run MapR w HBase on EMR stick w Amazon's release.
>> 
>> And you can run it but you will want to know your tuning parameters up
>> front when you instantiate it.
>> 
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 8, 2013, at 9:04 PM, Andrew Purtell <ap...@apache.org> wrote:
>> 
>>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL datastore
>>> with (I gather) an Apache HBase compatible Java API.
>>> 
>>> As for running HBase on EC2, we recently discussed some particulars, see
>>> the latter part of this thread: http://search-hadoop.com/m/rI1HpK90guwhere
>>> I hijack it. I wouldn't recommend launching HBase as part of an EMR flow
>>> unless you want to use it only for temporary random access storage, and
>> in
>>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up a
>>> dedicated HBase backed storage service on high I/O instance types. The
>>> fundamental issue is IO performance on the EC2 platform is fair to poor.
>>> 
>>> I have also noticed a large difference in baseline block device latency
>> if
>>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this year.
>>> Use the new ones, they cut the latency long tail in half. There were some
>>> significant kernel level improvements I gather.
>>> 
>>> 
>>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
>>> marcosluis2186@gmail.com> wrote:
>>> 
>>>> I think that you when you are talking about RMap, you are referring to
>>>> MapR´s distribution.
>>>> I think that MapR´s team released a very good version of its Hadoop
>>>> distribution focused on HBase called M7. You can see its overview here:
>>>> http://www.mapr.com/products/mapr-editions/m7-edition
>>>> 
>>>> But this release was under beta testing, and I see that it´s not
>> included
>>>> in the Amazon Marketplace yet:
>> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2013/5/7 Pal Konyves <pa...@gmail.com>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Has anyone got some recommendations about running HBase on EC2? I am
>>>>> testing it, and so far I am very disappointed with it. I did not change
>>>>> anything about the default 'Amazon distribution' installation. It has
>> one
>>>>> MasterNode and two slave nodes, and write performance is around 2500
>>>> small
>>>>> rows per sec at most, but I expected it to be way  better. Oh, and this
>>>> is
>>>>> with batch put operations with autocommit turned off, where each batch
>>>>> containes about 500-1000 rows... When I do it with autocommit, it does
>>>> not
>>>>> even reach the 1000 rows per sec.
>>>>> 
>>>>> Every nodes were m1.Large ones.
>>>>> 
>>>>> Any experiences, suggestions? Is it worth to try the RMap distribution
>>>>> instead of the amazon one?
>>>>> 
>>>>> Thanks,
>>>>> Pal
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Marcos Ortiz Valmaseda
>>>> Product Manager at PDVSA
>>>> http://about.me/marcosortiz
>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> 
>>>  - Andy
>>> 
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>> (via Tom White)
>> 

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Pal Konyves <pa...@gmail.com>.
Principally I chose to use Amazon, because they are supposedly high
performance, and what more important is: HBase is already set up if I chose
it as an EMR Workflow. I wanted to save up the time setting up the cluster
manually on EC2 instances.

Are you saying I will reach higher performance when I set up the HBase on
the cluster manually, instead of the default Amazon HBase distribution? Or
is it worth to tune the Amazon distribution with a bootstrap action? How
long does it take, to set up the cluster with HDFS manually?

I will also try larger instance types.


On Thu, May 9, 2013 at 6:47 AM, Michel Segel <mi...@hotmail.com>wrote:

> With respect to EMR, you can run HBase fairly easily.
> You can't run MapR w HBase on EMR stick w Amazon's release.
>
> And you can run it but you will want to know your tuning parameters up
> front when you instantiate it.
>
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 8, 2013, at 9:04 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> > M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL datastore
> > with (I gather) an Apache HBase compatible Java API.
> >
> > As for running HBase on EC2, we recently discussed some particulars, see
> > the latter part of this thread: http://search-hadoop.com/m/rI1HpK90guwhere
> > I hijack it. I wouldn't recommend launching HBase as part of an EMR flow
> > unless you want to use it only for temporary random access storage, and
> in
> > which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up a
> > dedicated HBase backed storage service on high I/O instance types. The
> > fundamental issue is IO performance on the EC2 platform is fair to poor.
> >
> > I have also noticed a large difference in baseline block device latency
> if
> > using an old Amazon Linux AMI (< 2013) or the latest AMIs from this year.
> > Use the new ones, they cut the latency long tail in half. There were some
> > significant kernel level improvements I gather.
> >
> >
> > On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
> > marcosluis2186@gmail.com> wrote:
> >
> >> I think that you when you are talking about RMap, you are referring to
> >> MapR´s distribution.
> >> I think that MapR´s team released a very good version of its Hadoop
> >> distribution focused on HBase called M7. You can see its overview here:
> >> http://www.mapr.com/products/mapr-editions/m7-edition
> >>
> >> But this release was under beta testing, and I see that it´s not
> included
> >> in the Amazon Marketplace yet:
> >>
> >>
> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
> >>
> >>
> >>
> >>
> >> 2013/5/7 Pal Konyves <pa...@gmail.com>
> >>
> >>> Hi,
> >>>
> >>> Has anyone got some recommendations about running HBase on EC2? I am
> >>> testing it, and so far I am very disappointed with it. I did not change
> >>> anything about the default 'Amazon distribution' installation. It has
> one
> >>> MasterNode and two slave nodes, and write performance is around 2500
> >> small
> >>> rows per sec at most, but I expected it to be way  better. Oh, and this
> >> is
> >>> with batch put operations with autocommit turned off, where each batch
> >>> containes about 500-1000 rows... When I do it with autocommit, it does
> >> not
> >>> even reach the 1000 rows per sec.
> >>>
> >>> Every nodes were m1.Large ones.
> >>>
> >>> Any experiences, suggestions? Is it worth to try the RMap distribution
> >>> instead of the amazon one?
> >>>
> >>> Thanks,
> >>> Pal
> >>
> >>
> >>
> >> --
> >> Marcos Ortiz Valmaseda
> >> Product Manager at PDVSA
> >> http://about.me/marcosortiz
> >
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Michel Segel <mi...@hotmail.com>.
With respect to EMR, you can run HBase fairly easily.
You can't run MapR w HBase on EMR stick w Amazon's release.

And you can run it but you will want to know your tuning parameters up front when you instantiate it.



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 8, 2013, at 9:04 PM, Andrew Purtell <ap...@apache.org> wrote:

> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL datastore
> with (I gather) an Apache HBase compatible Java API.
> 
> As for running HBase on EC2, we recently discussed some particulars, see
> the latter part of this thread: http://search-hadoop.com/m/rI1HpK90gu where
> I hijack it. I wouldn't recommend launching HBase as part of an EMR flow
> unless you want to use it only for temporary random access storage, and in
> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up a
> dedicated HBase backed storage service on high I/O instance types. The
> fundamental issue is IO performance on the EC2 platform is fair to poor.
> 
> I have also noticed a large difference in baseline block device latency if
> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this year.
> Use the new ones, they cut the latency long tail in half. There were some
> significant kernel level improvements I gather.
> 
> 
> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
> marcosluis2186@gmail.com> wrote:
> 
>> I think that you when you are talking about RMap, you are referring to
>> MapR´s distribution.
>> I think that MapR´s team released a very good version of its Hadoop
>> distribution focused on HBase called M7. You can see its overview here:
>> http://www.mapr.com/products/mapr-editions/m7-edition
>> 
>> But this release was under beta testing, and I see that it´s not included
>> in the Amazon Marketplace yet:
>> 
>> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
>> 
>> 
>> 
>> 
>> 2013/5/7 Pal Konyves <pa...@gmail.com>
>> 
>>> Hi,
>>> 
>>> Has anyone got some recommendations about running HBase on EC2? I am
>>> testing it, and so far I am very disappointed with it. I did not change
>>> anything about the default 'Amazon distribution' installation. It has one
>>> MasterNode and two slave nodes, and write performance is around 2500
>> small
>>> rows per sec at most, but I expected it to be way  better. Oh, and this
>> is
>>> with batch put operations with autocommit turned off, where each batch
>>> containes about 500-1000 rows... When I do it with autocommit, it does
>> not
>>> even reach the 1000 rows per sec.
>>> 
>>> Every nodes were m1.Large ones.
>>> 
>>> Any experiences, suggestions? Is it worth to try the RMap distribution
>>> instead of the amazon one?
>>> 
>>> Thanks,
>>> Pal
>> 
>> 
>> 
>> --
>> Marcos Ortiz Valmaseda
>> Product Manager at PDVSA
>> http://about.me/marcosortiz
> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Amandeep Khurana <am...@gmail.com>.
To add to what Andy said - the key to getting HBase running well in AWS is:

1. Choose the right instance types. I usually recommend the HPC
instances or now the high storage density instances. Those will give
you the best performance.

2. Use the latest Amzn Linux AMIs and the latest HBase and HDFS
versions that work with each other.

3. Tune HBase for your workload. This you have to do anyway but HBase
on AWS is less forgiving as compared on on premise.

I've personally tested upto 10k req/sec/server writing 1K payloads on
HBase 0.92 (that's old!) on HPC instances.


On May 8, 2013, at 9:05 PM, Andrew Purtell <ap...@apache.org> wrote:

> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL datastore
> with (I gather) an Apache HBase compatible Java API.
>
> As for running HBase on EC2, we recently discussed some particulars, see
> the latter part of this thread: http://search-hadoop.com/m/rI1HpK90gu where
> I hijack it. I wouldn't recommend launching HBase as part of an EMR flow
> unless you want to use it only for temporary random access storage, and in
> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up a
> dedicated HBase backed storage service on high I/O instance types. The
> fundamental issue is IO performance on the EC2 platform is fair to poor.
>
> I have also noticed a large difference in baseline block device latency if
> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this year.
> Use the new ones, they cut the latency long tail in half. There were some
> significant kernel level improvements I gather.
>
>
> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
> marcosluis2186@gmail.com> wrote:
>
>> I think that you when you are talking about RMap, you are referring to
>> MapR´s distribution.
>> I think that MapR´s team released a very good version of its Hadoop
>> distribution focused on HBase called M7. You can see its overview here:
>> http://www.mapr.com/products/mapr-editions/m7-edition
>>
>> But this release was under beta testing, and I see that it´s not included
>> in the Amazon Marketplace yet:
>>
>> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
>>
>>
>>
>>
>> 2013/5/7 Pal Konyves <pa...@gmail.com>
>>
>>> Hi,
>>>
>>> Has anyone got some recommendations about running HBase on EC2? I am
>>> testing it, and so far I am very disappointed with it. I did not change
>>> anything about the default 'Amazon distribution' installation. It has one
>>> MasterNode and two slave nodes, and write performance is around 2500
>> small
>>> rows per sec at most, but I expected it to be way  better. Oh, and this
>> is
>>> with batch put operations with autocommit turned off, where each batch
>>> containes about 500-1000 rows... When I do it with autocommit, it does
>> not
>>> even reach the 1000 rows per sec.
>>>
>>> Every nodes were m1.Large ones.
>>>
>>> Any experiences, suggestions? Is it worth to try the RMap distribution
>>> instead of the amazon one?
>>>
>>> Thanks,
>>> Pal
>>
>>
>>
>> --
>> Marcos Ortiz Valmaseda
>> Product Manager at PDVSA
>> http://about.me/marcosortiz
>
>
>
> --
> Best regards,
>
>   - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Andrew Purtell <ap...@apache.org>.
M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL datastore
with (I gather) an Apache HBase compatible Java API.

As for running HBase on EC2, we recently discussed some particulars, see
the latter part of this thread: http://search-hadoop.com/m/rI1HpK90gu where
I hijack it. I wouldn't recommend launching HBase as part of an EMR flow
unless you want to use it only for temporary random access storage, and in
which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up a
dedicated HBase backed storage service on high I/O instance types. The
fundamental issue is IO performance on the EC2 platform is fair to poor.

I have also noticed a large difference in baseline block device latency if
using an old Amazon Linux AMI (< 2013) or the latest AMIs from this year.
Use the new ones, they cut the latency long tail in half. There were some
significant kernel level improvements I gather.


On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
marcosluis2186@gmail.com> wrote:

> I think that you when you are talking about RMap, you are referring to
> MapR´s distribution.
> I think that MapR´s team released a very good version of its Hadoop
> distribution focused on HBase called M7. You can see its overview here:
> http://www.mapr.com/products/mapr-editions/m7-edition
>
> But this release was under beta testing, and I see that it´s not included
> in the Amazon Marketplace yet:
>
> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
>
>
>
>
> 2013/5/7 Pal Konyves <pa...@gmail.com>
>
> > Hi,
> >
> > Has anyone got some recommendations about running HBase on EC2? I am
> > testing it, and so far I am very disappointed with it. I did not change
> > anything about the default 'Amazon distribution' installation. It has one
> > MasterNode and two slave nodes, and write performance is around 2500
> small
> > rows per sec at most, but I expected it to be way  better. Oh, and this
> is
> > with batch put operations with autocommit turned off, where each batch
> > containes about 500-1000 rows... When I do it with autocommit, it does
> not
> > even reach the 1000 rows per sec.
> >
> > Every nodes were m1.Large ones.
> >
> > Any experiences, suggestions? Is it worth to try the RMap distribution
> > instead of the amazon one?
> >
> > Thanks,
> > Pal
> >
>
>
>
> --
> Marcos Ortiz Valmaseda
> Product Manager at PDVSA
> http://about.me/marcosortiz
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: EC2 Elastic MapReduce HBase install recommendations

Posted by Marcos Luis Ortiz Valmaseda <ma...@gmail.com>.
I think that you when you are talking about RMap, you are referring to
MapR´s distribution.
I think that MapR´s team released a very good version of its Hadoop
distribution focused on HBase called M7. You can see its overview here:
http://www.mapr.com/products/mapr-editions/m7-edition

But this release was under beta testing, and I see that it´s not included
in the Amazon Marketplace yet:
https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5




2013/5/7 Pal Konyves <pa...@gmail.com>

> Hi,
>
> Has anyone got some recommendations about running HBase on EC2? I am
> testing it, and so far I am very disappointed with it. I did not change
> anything about the default 'Amazon distribution' installation. It has one
> MasterNode and two slave nodes, and write performance is around 2500 small
> rows per sec at most, but I expected it to be way  better. Oh, and this is
> with batch put operations with autocommit turned off, where each batch
> containes about 500-1000 rows... When I do it with autocommit, it does not
> even reach the 1000 rows per sec.
>
> Every nodes were m1.Large ones.
>
> Any experiences, suggestions? Is it worth to try the RMap distribution
> instead of the amazon one?
>
> Thanks,
> Pal
>



-- 
Marcos Ortiz Valmaseda
Product Manager at PDVSA
http://about.me/marcosortiz