You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Cristofer Weber <cr...@neogrid.com> on 2012/08/30 15:27:05 UTC

[maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Just read this article, "Solving Big Data Challenges for Enterprise Application Performance Management." published this month @ Volume 5, No.12 of Proceedings of the VLDB Endowment, where they measured 6 different databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster and VoltDB - with YCSB on two different kind of clusters, Memory-bound and Disk-bound,  and I'm in doubt about results for HBase since:


*         HBase version was 0.90.4

*         Master nodes were deployed together with data nodes

*         They didn't reported tuning parameters

There's also a paragraph where they reported that HBase failed frequently in non-deterministic ways while running YCSB.

My intention with this e-mail is to look for opinions from you, who are more experienced with HBase, on where this experiment's setup could be changed to improve read operations, since in this setup HBase did not performed as well as Cassandra and Project Voldemort.

Here's the article: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5 home: http://vldb.org/pvldb/vol5.html

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Thu, Aug 30, 2012 at 6:49 AM, Dave Wang <ds...@cloudera.com> wrote:
> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
> the more recent work done in the area of performance.

VLDB papers are submitted way in advance. For example this year was
April 2011 - March 2012, where you submit at the beginning of each
month, and for all months before Jan 2012 you were able to get
revision request to give you a better chance at making it to the
conference.

That, and it takes time writing a paper :)

J-D

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Andrew Purtell <ap...@apache.org>.

Asynchbase redone with PB and attention to security would be a good place
to start. I can't commit resources in the immediate term, so that's easy
for me to say I know. Anyway seems we're on the same page wrt client.

On Friday, August 31, 2012, lars hofhansl wrote:

> Many of us have been saying for a while that the client needs love (i.e.
> needs to be rewritten) and that a new client should follow an async API
> (maybe with a thin synchronous veneer of top of it).
>
> The client is a big piece of HBase. And implementing all the aspects
> including security is a major task and nobody has committed the necessary
> resources for it, yet.
> asynchbase is a start, but it does not support many of the HBase features
> (coprocessors, security, etc).
>
> -- Lars
>
>   ------------------------------
> *From:* Andrew Purtell <apurtell@apache.org <javascript:_e({}, 'cvml',
> 'apurtell@apache.org');>>
> *To:* "user@hbase.apache.org <javascript:_e({}, 'cvml',
> 'user@hbase.apache.org');>" <user@hbase.apache.org <javascript:_e({},
> 'cvml', 'user@hbase.apache.org');>>; lars hofhansl <lhofhansl@yahoo.com<javascript:_e({}, 'cvml', 'lhofhansl@yahoo.com');>>
>
> *Sent:* Thursday, August 30, 2012 2:41 PM
> *Subject:* Re: [maybe off-topic?] article: Solving Big Data Challenges
> for Enterprise Application Performance Management
>
> I do want to take a closer look at it. Not with the intent to replace the
> PB RPC with it but its odd to have two RPC stacks. What refactoring and
> code simplification/removal opportunities are here? Don't know (yet). More
> generally, to experiment with simple native async clients.
>
> On Thursday, August 30, 2012, lars hofhansl wrote:
>
> 0.94+ has the option to run a thrift-server-thread inside the
> RegionServers. Maybe we should improve upon that?
>
>
>
> ________________________________
>  From: Andrew Purtell <ap...@apache.org>
> To: Andrew Purtell <ap...@apache.org>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Thursday, August 30, 2012 9:41 AM
> Subject: Re: [maybe off-topic?] article: Solving Big Data Challenges for
> Enterprise Application Performance Management
>
> Just want to clarify I mean experimenting with the approach of the Thrift
> client work not use of Thrift particularly.
>
> On Thursday, August 30, 2012, Andrew Purtell wrote:
>
> > This paper could very well have benchmarked the relative performance of
> > the YCSB drivers. Some take aways for me here are:
> >
> >     - Cluster setup is too difficult still
> >
> >     - There are opportunities for autotuning that would make it easier
> for
> > users to get it right the first time and for academics and casual
> > benchmarkers alike to get a good result without becoming experts with
> HBase
> > configuration
> >
> >     - The client library has been evolving toward fully async dispatch,
> we
> > should focus on this, perhaps even consider reimplementing sync client
> on a
> > refactored async core. And look at making the Thrift based stuff FB put
> in
> > front and center, because then native clients are possible.
> >
> >     - Given the above client work, the YCSB HBase driver should have a
> > rewrite.
> >
> > On Thu, Aug 30, 2012 at 4:49 PM, Dave Wang <dsw@cloudera.com<javascript:_e({},
> 'cvml', 'dsw@cloudera.com');>
> > > wrote:
> >
> >> My reading of the paper is that they are actually not clear about
> whether
> >> or not HMasters were deployed on datanodes.
> >>
> >> I'm going to guess that they just used default configurations for HBase
> >> and
> >> YCSB, but the paper again is not specific enough.
> >>
> >> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
> >> the more recent work done in the area of performance.
> >>
> >> One thing the paper does touch on is the relative difficulty of standing
> >> up
> >> the cluster, which has not changed since 0.90.4.  I think that's
> >> definitely
> >> something that could be improved upon.
> >>
> >> - Dave
> >>
> >> On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
> >> cristofer.weber@neogrid.com <javascript:_e({}, 'cvml',
> >> 'cristofer.weber@neogrid.com');>> wrote:
> >>
> >> > Just read this article, "Solving Big Data Challenges for Enterprise
> >> > Application Performance Management." published this month @ Volume 5,
> >> No.12
> >> > of Proceedings of the VLDB Endowment, where they measured 6 different
> >> > databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster
> >> and
> >> > VoltDB - with YCSB on two different kind of clusters, Memory-bound and
> >> > Disk-bound,  and I'm in doubt about results for HBase since:
> >> >
> >> >
> >> > *         HBase version was 0.90.4
> >> >
> >> > *         Master nodes were deployed together with data nodes
> >> >
> >> > *         They didn't reported tuning parameters
> >> >
> >> > There's also a paragraph where they reported that HBase failed
> >> frequently
> >> > in non-deterministic ways while running YCSB.
> >> >
> >> > My intention with this e-mail is to look for opinions from you, who
> are
> >> > more experienced with HBase, on where this experiment's setup could be
> >> > changed to improve read operations, since in this setup HBase did not
> >> > performed as well as Cassandra and Project Voldemort.
> >> >
> >> > Here's the article:
> >> > http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and V
>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by lars hofhansl <lh...@yahoo.com>.

Many of us have been saying for a while that the client needs love (i.e. needs to be rewritten) and that a new client should follow an async API (maybe with a thin synchronous veneer of top of it).

The client is a big piece of HBase. And implementing all the aspects including security is a major task and nobody has committed the necessary resources for it, yet.
asynchbase is a start, but it does not support many of the HBase features (coprocessors, security, etc).


-- Lars



________________________________
 From: Andrew Purtell <ap...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com> 
Sent: Thursday, August 30, 2012 2:41 PM
Subject: Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management
 

I do want to take a closer look at it. Not with the intent to replace the PB RPC with it but its odd to have two RPC stacks. What refactoring and code simplification/removal opportunities are here? Don't know (yet). More generally, to experiment with simple native async clients. 

On Thursday, August 30, 2012, lars hofhansl  wrote:

0.94+ has the option to run a thrift-server-thread inside the RegionServers. Maybe we should improve upon that?
>
>
>
>________________________________
> From: Andrew Purtell <ap...@apache.org>
>To: Andrew Purtell <ap...@apache.org>
>Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
>Sent: Thursday, August 30, 2012 9:41 AM
>Subject: Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management
>
>Just want to clarify I mean experimenting with the approach of the Thrift
>client work not use of Thrift particularly.
>
>On Thursday, August 30, 2012, Andrew Purtell wrote:
>
>> This paper could very well have benchmarked the relative performance of
>> the YCSB drivers. Some take aways for me here are:
>>
>>     - Cluster setup is too difficult still
>>
>>     - There are opportunities for autotuning that would make it easier for
>> users to get it right the first time and for academics and casual
>> benchmarkers alike to get a good result without becoming experts with HBase
>> configuration
>>
>>     - The client library has been evolving toward fully async dispatch, we
>> should focus on this, perhaps even consider reimplementing sync client on a
>> refactored async core. And look at making the Thrift based stuff FB put in
>> front and center, because then native clients are possible.
>>
>>     - Given the above client work, the YCSB HBase driver should have a
>> rewrite.
>>
>> On Thu, Aug 30, 2012 at 4:49 PM, Dave Wang <dsw@cloudera.com<javascript:_e({}, 'cvml', 'dsw@cloudera.com');>
>> > wrote:
>>
>>> My reading of the paper is that they are actually not clear about whether
>>> or not HMasters were deployed on datanodes.
>>>
>>> I'm going to guess that they just used default configurations for HBase
>>> and
>>> YCSB, but the paper again is not specific enough.
>>>
>>> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
>>> the more recent work done in the area of performance.
>>>
>>> One thing the paper does touch on is the relative difficulty of standing
>>> up
>>> the cluster, which has not changed since 0.90.4.  I think that's
>>> definitely
>>> something that could be improved upon.
>>>
>>> - Dave
>>>
>>> On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
>>> cristofer.weber@neogrid.com <javascript:_e({}, 'cvml',
>>> 'cristofer.weber@neogrid.com');>> wrote:
>>>
>>> > Just read this article, "Solving Big Data Challenges for Enterprise
>>> > Application Performance Management." published this month @ Volume 5,
>>> No.12
>>> > of Proceedings of the VLDB Endowment, where they measured 6 different
>>> > databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster
>>> and
>>> > VoltDB - with YCSB on two different kind of clusters, Memory-bound and
>>> > Disk-bound,  and I'm in doubt about results for HBase since:
>>> >
>>> >
>>> > *         HBase version was 0.90.4
>>> >
>>> > *         Master nodes were deployed together with data nodes
>>> >
>>> > *         They didn't reported tuning parameters
>>> >
>>> > There's also a paragraph where they reported that HBase failed
>>> frequently
>>> > in non-deterministic ways while running YCSB.
>>> >
>>> > My intention with this e-mail is to look for opinions from you, who are
>>> > more experienced with HBase, on where this experiment's setup could be
>>> > changed to improve read operations, since in this setup HBase did not
>>> > performed as well as Cassandra and Project Voldemort.
>>> >
>>> > Here's the article:
>>> > http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5
>>> > home: http://vldb.org/pvldb/vol5.html
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>>
>
>--
>Best regards,
>
>   - Andy
>
>Problems worthy of attack prove their worth by hitting back. - Piet Hein
>(via Tom White)

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Andrew Purtell <ap...@apache.org>.

I do want to take a closer look at it. Not with the intent to replace the
PB RPC with it but its odd to have two RPC stacks. What refactoring and
code simplification/removal opportunities are here? Don't know (yet). More
generally, to experiment with simple native async clients.

On Thursday, August 30, 2012, lars hofhansl wrote:

> 0.94+ has the option to run a thrift-server-thread inside the
> RegionServers. Maybe we should improve upon that?
>
>
>
> ________________________________
>  From: Andrew Purtell <apurtell@apache.org <javascript:;>>
> To: Andrew Purtell <apurtell@apache.org <javascript:;>>
> Cc: "user@hbase.apache.org <javascript:;>" <user@hbase.apache.org<javascript:;>
> >
> Sent: Thursday, August 30, 2012 9:41 AM
> Subject: Re: [maybe off-topic?] article: Solving Big Data Challenges for
> Enterprise Application Performance Management
>
> Just want to clarify I mean experimenting with the approach of the Thrift
> client work not use of Thrift particularly.
>
> On Thursday, August 30, 2012, Andrew Purtell wrote:
>
> > This paper could very well have benchmarked the relative performance of
> > the YCSB drivers. Some take aways for me here are:
> >
> >     - Cluster setup is too difficult still
> >
> >     - There are opportunities for autotuning that would make it easier
> for
> > users to get it right the first time and for academics and casual
> > benchmarkers alike to get a good result without becoming experts with
> HBase
> > configuration
> >
> >     - The client library has been evolving toward fully async dispatch,
> we
> > should focus on this, perhaps even consider reimplementing sync client
> on a
> > refactored async core. And look at making the Thrift based stuff FB put
> in
> > front and center, because then native clients are possible.
> >
> >     - Given the above client work, the YCSB HBase driver should have a
> > rewrite.
> >
> > On Thu, Aug 30, 2012 at 4:49 PM, Dave Wang <dsw@cloudera.com<javascript:;><javascript:_e({},
> 'cvml', 'dsw@cloudera.com <javascript:;>');>
> > > wrote:
> >
> >> My reading of the paper is that they are actually not clear about
> whether
> >> or not HMasters were deployed on datanodes.
> >>
> >> I'm going to guess that they just used default configurations for HBase
> >> and
> >> YCSB, but the paper again is not specific enough.
> >>
> >> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
> >> the more recent work done in the area of performance.
> >>
> >> One thing the paper does touch on is the relative difficulty of standing
> >> up
> >> the cluster, which has not changed since 0.90.4.  I think that's
> >> definitely
> >> something that could be improved upon.
> >>
> >> - Dave
> >>
> >> On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
> >> cristofer.weber@neogrid.com <javascript:;> <javascript:_e({}, 'cvml',
> >> 'cristofer.weber@neogrid.com <javascript:;>');>> wrote:
> >>
> >> > Just read this article, "Solving Big Data Challenges for Enterprise
> >> > Application Performance Management." published this month @ Volume 5,
> >> No.12
> >> > of Proceedings of the VLDB Endowment, where they measured 6 different
> >> > databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster
> >> and
> >> > VoltDB - with YCSB on two different kind of clusters, Memory-bound and
> >> > Disk-bound,  and I'm in doubt about results for HBase since:
> >> >
> >> >
> >> > *         HBase version was 0.90.4
> >> >
> >> > *         Master nodes were deployed together with data nodes
> >> >
> >> > *         They didn't reported tuning parameters
> >> >
> >> > There's also a paragraph where they reported that HBase failed
> >> frequently
> >> > in non-deterministic ways while running YCSB.
> >> >
> >> > My intention with this e-mail is to look for opinions from you, who
> are
> >> > more experienced with HBase, on where this experiment's setup could be
> >> > changed to improve read operations, since in this setup HBase did not
> >> > performed as well as Cassandra and Project Voldemort.
> >> >
> >> > Here's the article:
> >> > http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume
> 5
> >> > home: http://vldb.org/pvldb/vol5.html
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
> >
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by lars hofhansl <lh...@yahoo.com>.

0.94+ has the option to run a thrift-server-thread inside the RegionServers. Maybe we should improve upon that?



________________________________
 From: Andrew Purtell <ap...@apache.org>
To: Andrew Purtell <ap...@apache.org> 
Cc: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Thursday, August 30, 2012 9:41 AM
Subject: Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management
 
Just want to clarify I mean experimenting with the approach of the Thrift
client work not use of Thrift particularly.

On Thursday, August 30, 2012, Andrew Purtell wrote:

> This paper could very well have benchmarked the relative performance of
> the YCSB drivers. Some take aways for me here are:
>
>     - Cluster setup is too difficult still
>
>     - There are opportunities for autotuning that would make it easier for
> users to get it right the first time and for academics and casual
> benchmarkers alike to get a good result without becoming experts with HBase
> configuration
>
>     - The client library has been evolving toward fully async dispatch, we
> should focus on this, perhaps even consider reimplementing sync client on a
> refactored async core. And look at making the Thrift based stuff FB put in
> front and center, because then native clients are possible.
>
>     - Given the above client work, the YCSB HBase driver should have a
> rewrite.
>
> On Thu, Aug 30, 2012 at 4:49 PM, Dave Wang <dsw@cloudera.com<javascript:_e({}, 'cvml', 'dsw@cloudera.com');>
> > wrote:
>
>> My reading of the paper is that they are actually not clear about whether
>> or not HMasters were deployed on datanodes.
>>
>> I'm going to guess that they just used default configurations for HBase
>> and
>> YCSB, but the paper again is not specific enough.
>>
>> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
>> the more recent work done in the area of performance.
>>
>> One thing the paper does touch on is the relative difficulty of standing
>> up
>> the cluster, which has not changed since 0.90.4.  I think that's
>> definitely
>> something that could be improved upon.
>>
>> - Dave
>>
>> On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
>> cristofer.weber@neogrid.com <javascript:_e({}, 'cvml',
>> 'cristofer.weber@neogrid.com');>> wrote:
>>
>> > Just read this article, "Solving Big Data Challenges for Enterprise
>> > Application Performance Management." published this month @ Volume 5,
>> No.12
>> > of Proceedings of the VLDB Endowment, where they measured 6 different
>> > databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster
>> and
>> > VoltDB - with YCSB on two different kind of clusters, Memory-bound and
>> > Disk-bound,  and I'm in doubt about results for HBase since:
>> >
>> >
>> > *         HBase version was 0.90.4
>> >
>> > *         Master nodes were deployed together with data nodes
>> >
>> > *         They didn't reported tuning parameters
>> >
>> > There's also a paragraph where they reported that HBase failed
>> frequently
>> > in non-deterministic ways while running YCSB.
>> >
>> > My intention with this e-mail is to look for opinions from you, who are
>> > more experienced with HBase, on where this experiment's setup could be
>> > changed to improve read operations, since in this setup HBase did not
>> > performed as well as Cassandra and Project Voldemort.
>> >
>> > Here's the article:
>> > http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5
>> > home: http://vldb.org/pvldb/vol5.html
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Andrew Purtell <ap...@apache.org>.

Just want to clarify I mean experimenting with the approach of the Thrift
client work not use of Thrift particularly.

On Thursday, August 30, 2012, Andrew Purtell wrote:

> This paper could very well have benchmarked the relative performance of
> the YCSB drivers. Some take aways for me here are:
>
>     - Cluster setup is too difficult still
>
>     - There are opportunities for autotuning that would make it easier for
> users to get it right the first time and for academics and casual
> benchmarkers alike to get a good result without becoming experts with HBase
> configuration
>
>     - The client library has been evolving toward fully async dispatch, we
> should focus on this, perhaps even consider reimplementing sync client on a
> refactored async core. And look at making the Thrift based stuff FB put in
> front and center, because then native clients are possible.
>
>     - Given the above client work, the YCSB HBase driver should have a
> rewrite.
>
> On Thu, Aug 30, 2012 at 4:49 PM, Dave Wang <dsw@cloudera.com<javascript:_e({}, 'cvml', 'dsw@cloudera.com');>
> > wrote:
>
>> My reading of the paper is that they are actually not clear about whether
>> or not HMasters were deployed on datanodes.
>>
>> I'm going to guess that they just used default configurations for HBase
>> and
>> YCSB, but the paper again is not specific enough.
>>
>> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
>> the more recent work done in the area of performance.
>>
>> One thing the paper does touch on is the relative difficulty of standing
>> up
>> the cluster, which has not changed since 0.90.4.  I think that's
>> definitely
>> something that could be improved upon.
>>
>> - Dave
>>
>> On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
>> cristofer.weber@neogrid.com <javascript:_e({}, 'cvml',
>> 'cristofer.weber@neogrid.com');>> wrote:
>>
>> > Just read this article, "Solving Big Data Challenges for Enterprise
>> > Application Performance Management." published this month @ Volume 5,
>> No.12
>> > of Proceedings of the VLDB Endowment, where they measured 6 different
>> > databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster
>> and
>> > VoltDB - with YCSB on two different kind of clusters, Memory-bound and
>> > Disk-bound,  and I'm in doubt about results for HBase since:
>> >
>> >
>> > *         HBase version was 0.90.4
>> >
>> > *         Master nodes were deployed together with data nodes
>> >
>> > *         They didn't reported tuning parameters
>> >
>> > There's also a paragraph where they reported that HBase failed
>> frequently
>> > in non-deterministic ways while running YCSB.
>> >
>> > My intention with this e-mail is to look for opinions from you, who are
>> > more experienced with HBase, on where this experiment's setup could be
>> > changed to improve read operations, since in this setup HBase did not
>> > performed as well as Cassandra and Project Voldemort.
>> >
>> > Here's the article:
>> > http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5
>> > home: http://vldb.org/pvldb/vol5.html
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Andrew Purtell <ap...@apache.org>.

This paper could very well have benchmarked the relative performance of the
YCSB drivers. Some take aways for me here are:

    - Cluster setup is too difficult still

    - There are opportunities for autotuning that would make it easier for
users to get it right the first time and for academics and casual
benchmarkers alike to get a good result without becoming experts with HBase
configuration

    - The client library has been evolving toward fully async dispatch, we
should focus on this, perhaps even consider reimplementing sync client on a
refactored async core. And look at making the Thrift based stuff FB put in
front and center, because then native clients are possible.

    - Given the above client work, the YCSB HBase driver should have a
rewrite.

On Thu, Aug 30, 2012 at 4:49 PM, Dave Wang <ds...@cloudera.com> wrote:

> My reading of the paper is that they are actually not clear about whether
> or not HMasters were deployed on datanodes.
>
> I'm going to guess that they just used default configurations for HBase and
> YCSB, but the paper again is not specific enough.
>
> Why were they using 0.90.4 in 2012?  Would have been nice to see some of
> the more recent work done in the area of performance.
>
> One thing the paper does touch on is the relative difficulty of standing up
> the cluster, which has not changed since 0.90.4.  I think that's definitely
> something that could be improved upon.
>
> - Dave
>
> On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
> cristofer.weber@neogrid.com> wrote:
>
> > Just read this article, "Solving Big Data Challenges for Enterprise
> > Application Performance Management." published this month @ Volume 5,
> No.12
> > of Proceedings of the VLDB Endowment, where they measured 6 different
> > databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster and
> > VoltDB - with YCSB on two different kind of clusters, Memory-bound and
> > Disk-bound,  and I'm in doubt about results for HBase since:
> >
> >
> > *         HBase version was 0.90.4
> >
> > *         Master nodes were deployed together with data nodes
> >
> > *         They didn't reported tuning parameters
> >
> > There's also a paragraph where they reported that HBase failed frequently
> > in non-deterministic ways while running YCSB.
> >
> > My intention with this e-mail is to look for opinions from you, who are
> > more experienced with HBase, on where this experiment's setup could be
> > changed to improve read operations, since in this setup HBase did not
> > performed as well as Cassandra and Project Voldemort.
> >
> > Here's the article:
> > http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5
> > home: http://vldb.org/pvldb/vol5.html
> >
> >
> >
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Stack <st...@duboce.net>.

On Thu, Aug 30, 2012 at 4:28 PM, Cristofer Weber
<cr...@neogrid.com> wrote:
> On the other hand, I think that I can help in a way or another, documenting undocumented features, collecting more data on effects of changes over default values and relating this changes to different HBase use cases, etc. It's hard to start contributing to Open Source projects sophisticated as HBase is, but can be a bit easer to contribute documenting features and running experiments, and I think that there are other ones wondering if they can contribute to HBase as well, but - speaking about myself - a lot of guidance is needed. Hope to get this guidance here ;-)
>

You can help loads Cristofer because you are in that temporary
condition where you are a noob on hbase turf (you are like the fellas
who wrote the paper at the start of their week of tuning).  And you
seem to be a <flattery>generally clueful fellow</flattery>.  Please do
make noise about what you think silly things a noob has to put up wth
getting their HBase system up and running.  The hoary old veterans
that hang out around these parts are no longer able to 'see' the HBase
issues a noob has having gotten across the hurdle a long time ago.
Lets at least file issues you run across on-boarding.  Thank you,
St.Ack

RES: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Cristofer Weber <cr...@neogrid.com>.

Being one of the guys who are "selling" the HBase idea at work (I've presented a PoC this week, by the way!), I know that sometime I will have to explain the conclusions from articles like this one, and this kind of conclusion probably will be really hard to explain. I will try to reach the authors to check which kind of failures they faced and the performance improvements that they made in their clusters, but this will not change the publication, sadly. 

On the other hand, I think that I can help in a way or another, documenting undocumented features, collecting more data on effects of changes over default values and relating this changes to different HBase use cases, etc. It's hard to start contributing to Open Source projects sophisticated as HBase is, but can be a bit easer to contribute documenting features and running experiments, and I think that there are other ones wondering if they can contribute to HBase as well, but - speaking about myself - a lot of guidance is needed. Hope to get this guidance here ;-)

Best regards,
Cristofer
________________________________________
De: saint.ack@gmail.com [saint.ack@gmail.com] em Nome de Stack [stack@duboce.net]
Enviado: quinta-feira, 30 de agosto de 2012 19:04
Para: user@hbase.apache.org
Assunto: Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

On Thu, Aug 30, 2012 at 7:51 AM, Cristofer Weber
<cr...@neogrid.com> wrote:
> About HMasters, yes, it's not clear.
>
> In section 6.1 they say that “Since we focused on a setup with a maximum of 12 nodes, we did not assign the master node and jobtracker to separate nodes instead we deployed them with data nodes."
>
> But in section 4.1 they say that "The conﬁguration was done using a dedicated node for the running master processes (NameNode and SecondaryNameNode), therefore for all the benchmarks the speciﬁed number of servers correspond to nodes running slave processes (DataNodes and TaskTrackers) as well as HBase’s region server processes."
>
> About configurations, the first paragraph on "6. EXPERIENCES" contains this: "In our initial test runs, we ran every system with the default conﬁguration, and then tried to improve the performance by changing various tuning parameters. We dedicated at least a week for conﬁguring and tuning each system (concentrating on one system at a time) to get a fair comparison."
>
> I agree that would be nice to see this experiment with 0.94.1, but 0.90.4 was released a year ago, so I understand that this version was the official version when these experiments were conducted.
>

Its a bit tough going back in time fixing 0.90.4 results.  The
"...failed frequently in non-deterministic ways..." is an ugly mark to
have hanging over hbase in a paper like this that will probably be
around a while.  I wonder what the cause was (I don't think that
typical of 0.90.4 IIRC).

On how to improve read performance, if its not in here,
http://hbase.apache.org/book.html#performance, in the refguide, then
the tuning option might as well not exist (Anyone see anything
missing).

We consistently do bad in these tests though our operational, actual
experience seems much better than what is shown in these benchmarks.
As has been said elsewhere on this thread, the takeaway is improved
defaults and auto-tuning but the only time we get interested in
addressing these issues is the once a year when one of these reports
come out; otherwise, we seem to have other priorities when messing in
hbase code base.

St.Ack

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Stack <st...@duboce.net>.

On Thu, Aug 30, 2012 at 7:51 AM, Cristofer Weber
<cr...@neogrid.com> wrote:
> About HMasters, yes, it's not clear.
>
> In section 6.1 they say that “Since we focused on a setup with a maximum of 12 nodes, we did not assign the master node and jobtracker to separate nodes instead we deployed them with data nodes."
>
> But in section 4.1 they say that "The conﬁguration was done using a dedicated node for the running master processes (NameNode and SecondaryNameNode), therefore for all the benchmarks the speciﬁed number of servers correspond to nodes running slave processes (DataNodes and TaskTrackers) as well as HBase’s region server processes."
>
> About configurations, the first paragraph on "6. EXPERIENCES" contains this: "In our initial test runs, we ran every system with the default conﬁguration, and then tried to improve the performance by changing various tuning parameters. We dedicated at least a week for conﬁguring and tuning each system (concentrating on one system at a time) to get a fair comparison."
>
> I agree that would be nice to see this experiment with 0.94.1, but 0.90.4 was released a year ago, so I understand that this version was the official version when these experiments were conducted.
>

Its a bit tough going back in time fixing 0.90.4 results.  The
"...failed frequently in non-deterministic ways..." is an ugly mark to
have hanging over hbase in a paper like this that will probably be
around a while.  I wonder what the cause was (I don't think that
typical of 0.90.4 IIRC).

On how to improve read performance, if its not in here,
http://hbase.apache.org/book.html#performance, in the refguide, then
the tuning option might as well not exist (Anyone see anything
missing).

We consistently do bad in these tests though our operational, actual
experience seems much better than what is shown in these benchmarks.
As has been said elsewhere on this thread, the takeaway is improved
defaults and auto-tuning but the only time we get interested in
addressing these issues is the once a year when one of these reports
come out; otherwise, we seem to have other priorities when messing in
hbase code base.

St.Ack

RES: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Cristofer Weber <cr...@neogrid.com>.

About HMasters, yes, it's not clear. 

In section 6.1 they say that “Since we focused on a setup with a maximum of 12 nodes, we did not assign the master node and jobtracker to separate nodes instead we deployed them with data nodes." 

But in section 4.1 they say that "The conﬁguration was done using a dedicated node for the running master processes (NameNode and SecondaryNameNode), therefore for all the benchmarks the speciﬁed number of servers correspond to nodes running slave processes (DataNodes and TaskTrackers) as well as HBase’s region server processes."

About configurations, the first paragraph on "6. EXPERIENCES" contains this: "In our initial test runs, we ran every system with the default conﬁguration, and then tried to improve the performance by changing various tuning parameters. We dedicated at least a week for conﬁguring and tuning each system (concentrating on one system at a time) to get a fair comparison." 

I agree that would be nice to see this experiment with 0.94.1, but 0.90.4 was released a year ago, so I understand that this version was the official version when these experiments were conducted. 

Best regards,
Cristofer

-----Mensagem original-----
De: Dave Wang [mailto:dsw@cloudera.com] 
Enviada em: quinta-feira, 30 de agosto de 2012 10:49
Para: user@hbase.apache.org
Assunto: Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

My reading of the paper is that they are actually not clear about whether or not HMasters were deployed on datanodes.

I'm going to guess that they just used default configurations for HBase and YCSB, but the paper again is not specific enough.

Why were they using 0.90.4 in 2012?  Would have been nice to see some of the more recent work done in the area of performance.

One thing the paper does touch on is the relative difficulty of standing up the cluster, which has not changed since 0.90.4.  I think that's definitely something that could be improved upon.

- Dave

On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber < cristofer.weber@neogrid.com> wrote:

> Just read this article, "Solving Big Data Challenges for Enterprise 
> Application Performance Management." published this month @ Volume 5, 
> No.12 of Proceedings of the VLDB Endowment, where they measured 6 
> different databases - Project Voldemort, Redis, HBase, Cassandra, 
> MySQL Cluster and VoltDB - with YCSB on two different kind of 
> clusters, Memory-bound and Disk-bound,  and I'm in doubt about results for HBase since:
>
>
> *         HBase version was 0.90.4
>
> *         Master nodes were deployed together with data nodes
>
> *         They didn't reported tuning parameters
>
> There's also a paragraph where they reported that HBase failed 
> frequently in non-deterministic ways while running YCSB.
>
> My intention with this e-mail is to look for opinions from you, who 
> are more experienced with HBase, on where this experiment's setup 
> could be changed to improve read operations, since in this setup HBase 
> did not performed as well as Cassandra and Project Voldemort.
>
> Here's the article:
> http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5
> home: http://vldb.org/pvldb/vol5.html
>
>
>
>

Re: [maybe off-topic?] article: Solving Big Data Challenges for Enterprise Application Performance Management

Posted by Dave Wang <ds...@cloudera.com>.

My reading of the paper is that they are actually not clear about whether
or not HMasters were deployed on datanodes.

I'm going to guess that they just used default configurations for HBase and
YCSB, but the paper again is not specific enough.

Why were they using 0.90.4 in 2012?  Would have been nice to see some of
the more recent work done in the area of performance.

One thing the paper does touch on is the relative difficulty of standing up
the cluster, which has not changed since 0.90.4.  I think that's definitely
something that could be improved upon.

- Dave

On Thu, Aug 30, 2012 at 6:27 AM, Cristofer Weber <
cristofer.weber@neogrid.com> wrote:

> Just read this article, "Solving Big Data Challenges for Enterprise
> Application Performance Management." published this month @ Volume 5, No.12
> of Proceedings of the VLDB Endowment, where they measured 6 different
> databases - Project Voldemort, Redis, HBase, Cassandra, MySQL Cluster and
> VoltDB - with YCSB on two different kind of clusters, Memory-bound and
> Disk-bound,  and I'm in doubt about results for HBase since:
>
>
> *         HBase version was 0.90.4
>
> *         Master nodes were deployed together with data nodes
>
> *         They didn't reported tuning parameters
>
> There's also a paragraph where they reported that HBase failed frequently
> in non-deterministic ways while running YCSB.
>
> My intention with this e-mail is to look for opinions from you, who are
> more experienced with HBase, on where this experiment's setup could be
> changed to improve read operations, since in this setup HBase did not
> performed as well as Cassandra and Project Voldemort.
>
> Here's the article:
> http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf and Volume 5
> home: http://vldb.org/pvldb/vol5.html
>
>
>
>