You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Tanakorn Leesatapornwongsa <ta...@cs.uchicago.edu> on 2016/04/08 01:18:46 UTC

Research on scalability bug finder for Cassandra

Dear Cassandra development team,

We are computer science researchers at the University of Chicago. Our research is about the reliability of cloud-scale distributed systems. Samples of our work can be found here: http://ucare.cs.uchicago.edu <http://ucare.cs.uchicago.edu/>

We are reaching out to you because we are interested in reproducing any unsolved scalability bugs in Cassandra.

We define scalability bugs as latent bugs that are scale-dependent. They don't arise in small-scale deployment but arise in large-scale production runs. For example, everything is fine in 100-node deployment but in 500-node deployment the bug appears.

We have created a scale-check methodology (SLCK) that can unearth scalability bugs in a single machine. With SLCK, we can run hundreds of nodes on a single machine and reproduce some old scalability bugs. For example, we have reproduced the following bugs in one machine:

- https://issues.apache.org/jira/browse/CASSANDRA-6127 <https://issues.apache.org/jira/browse/CASSANDRA-6127> (a customer observed node flapping when bootstrapping 1000 nodes)

- https://issues.apache.org/jira/browse/CASSANDRA-3831 <https://issues.apache.org/jira/browse/CASSANDRA-3831>

We are submitting SLCK for publication soon, and we can send you a draft a month from now if you are interested.

To make a stronger publication submission, beyond reproducing old bugs, we thought it would be great if SLCK can reproduce new scalability bugs (if any) that you are still trying to resolve.

We hope you find our work interesting and we would really appreciate if you can point to us any new scalability bugs that hopefully we can help you reproduce.

Thank you very much for your attention!

Best,
Tanakorn L.

Re: Research on scalability bug finder for Cassandra

Posted by Jason Brown <ja...@gmail.com>.

+1 to Tupshin's proposal: 10k nodes (massive clusters) really is the next
frontier.

I don't expect the vnodes to add that much to the gossip dissemination as
the tokens per-node are sent out only a handful of times (when a node joins
the ring, mostly). Without having hard data to back myself up, I'd suspect
that the Failure Detector fails first, as with clusters that large, we
probably can't propagate the heartbeats (via gossip) fast enough/regularly
enough so that peers start marking each other down (due to the Phi Accrual
alogrithm).

Thanks,

-Jason

Re: Research on scalability bug finder for Cassandra

Posted by tu...@tupshin.com.

Hi Haryadi,

Personally I'd love to see your approach extended to test up to 10K
nodes, or so.

There are not too many known instances of scaling past 1000 nodes, and
as the need for scale grows, and as scale out hardware becomes more
commonplace (high density, but with lots of small servers...aka hp
moonshot, blade servers, etc), 10K nodes is the next frontier. Would be
great to demonstrate that your tool can find *new* bugs and limitations 
(which it certainly would at that scale), as opposed to just reproducing
existing ones.

One other thought is to test with both non-vnodes and vnodes (and maybe
multiple number of vnodes per node) at extreme scales like that to get a
sense of what kind of overhead vnodes adds to the current gossip
implementation at scale.

Regarding existing bugs that you might usefully reproduce, I'll leave
that to others.

Thanks.

-Tupshin

On Fri, Apr 8, 2016, at 09:57 PM, Haryadi Gunawi wrote:
> Hi Jonathan,
> 
> Thanks for the reply!
> 
> We don't need a patched version of Cassandra.   Specifically, this is
> what
> we'd like to get help from you if possible:
> 
> Cassandra devs:  "Here are recent JIRA entries that discuss
> scale-dependent
> bugs: CASSANDRA-X, -Y, -Z (where XYZ are JIRA bug#)"
> 
> Our side: We will study the bug discussions, download the affected
> Cassandra version (as mentioned in the JIRA), integrate that specific
> version with our framework, and reproduce the bug in one machine.
> 
> Basically, we're interested to know if there are still unresolved or
> newly-resolved bugs (2015-2016) in Cassandra JIRA that we could use to
> test
> our approach.  (The bugs in our previous email are relatively old).
> 
> 
> We're targeting a publication deadline one month from now.  It'd be
> lovely
> if we get more sample bugs.  After the deadline, we'd be happy to send
> you
> the draft of the paper.
> 
> Please do let us know if you have any other questions.
> Thanks!
> -- Har
> 
> 
> 
> On Fri, Apr 8, 2016 at 8:03 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> 
> > Sounds very interesting!  We'd love to hear more about your approach.  In
> > particular, does it require a patched version of Cassandra?
> >
> > On Thu, Apr 7, 2016 at 6:18 PM, Tanakorn Leesatapornwongsa <
> > tanakorn@cs.uchicago.edu> wrote:
> >
> >> Dear Cassandra development team,
> >>
> >> We are computer science researchers at the University of Chicago.  Our
> >> research is about the reliability of cloud-scale distributed systems.
> >> Samples of our work can be found here: http://ucare.cs.uchicago.edu <
> >> http://ucare.cs.uchicago.edu/>
> >>
> >> We are reaching out to you because we are interested in reproducing any
> >> unsolved scalability bugs in Cassandra.
> >>
> >> We define scalability bugs as latent bugs that are scale-dependent.  They
> >> don't arise in small-scale deployment but arise in large-scale production
> >> runs.  For example, everything is fine in 100-node deployment but in
> >> 500-node deployment the bug appears.
> >>
> >> We have created a scale-check methodology (SLCK) that can unearth
> >> scalability bugs in a single machine.  With SLCK, we can run hundreds of
> >> nodes on a single machine and reproduce some old scalability bugs. For
> >> example, we have reproduced the following bugs in one machine:
> >>
> >> - https://issues.apache.org/jira/browse/CASSANDRA-6127 <
> >> https://issues.apache.org/jira/browse/CASSANDRA-6127>   (a customer
> >> observed node flapping when bootstrapping 1000 nodes)
> >>
> >> - https://issues.apache.org/jira/browse/CASSANDRA-3831 <
> >> https://issues.apache.org/jira/browse/CASSANDRA-3831>
> >>
> >> We are submitting SLCK for publication soon, and we can send you a draft
> >> a month from now if you are interested.
> >>
> >> To make a stronger publication submission, beyond reproducing old bugs,
> >> we thought it would be great if SLCK can reproduce new scalability bugs (if
> >> any) that you are still trying to resolve.
> >>
> >> We hope you find our work interesting and we would really appreciate if
> >> you can point to us any new scalability bugs that hopefully we can help you
> >> reproduce.
> >>
> >> Thank you very much for your attention!
> >>
> >> Best,
> >> Tanakorn L.
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder, http://www.datastax.com
> > @spyced
> >

Re: Research on scalability bug finder for Cassandra

Posted by Haryadi Gunawi <ha...@cs.uchicago.edu>.

Hi Jonathan,

Thanks for the reply!

We don't need a patched version of Cassandra.   Specifically, this is what
we'd like to get help from you if possible:

Cassandra devs:  "Here are recent JIRA entries that discuss scale-dependent
bugs: CASSANDRA-X, -Y, -Z (where XYZ are JIRA bug#)"

Our side: We will study the bug discussions, download the affected
Cassandra version (as mentioned in the JIRA), integrate that specific
version with our framework, and reproduce the bug in one machine.

Basically, we're interested to know if there are still unresolved or
newly-resolved bugs (2015-2016) in Cassandra JIRA that we could use to test
our approach.  (The bugs in our previous email are relatively old).


We're targeting a publication deadline one month from now.  It'd be lovely
if we get more sample bugs.  After the deadline, we'd be happy to send you
the draft of the paper.

Please do let us know if you have any other questions.
Thanks!
-- Har



On Fri, Apr 8, 2016 at 8:03 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Sounds very interesting!  We'd love to hear more about your approach.  In
> particular, does it require a patched version of Cassandra?
>
> On Thu, Apr 7, 2016 at 6:18 PM, Tanakorn Leesatapornwongsa <
> tanakorn@cs.uchicago.edu> wrote:
>
>> Dear Cassandra development team,
>>
>> We are computer science researchers at the University of Chicago.  Our
>> research is about the reliability of cloud-scale distributed systems.
>> Samples of our work can be found here: http://ucare.cs.uchicago.edu <
>> http://ucare.cs.uchicago.edu/>
>>
>> We are reaching out to you because we are interested in reproducing any
>> unsolved scalability bugs in Cassandra.
>>
>> We define scalability bugs as latent bugs that are scale-dependent.  They
>> don't arise in small-scale deployment but arise in large-scale production
>> runs.  For example, everything is fine in 100-node deployment but in
>> 500-node deployment the bug appears.
>>
>> We have created a scale-check methodology (SLCK) that can unearth
>> scalability bugs in a single machine.  With SLCK, we can run hundreds of
>> nodes on a single machine and reproduce some old scalability bugs. For
>> example, we have reproduced the following bugs in one machine:
>>
>> - https://issues.apache.org/jira/browse/CASSANDRA-6127 <
>> https://issues.apache.org/jira/browse/CASSANDRA-6127>   (a customer
>> observed node flapping when bootstrapping 1000 nodes)
>>
>> - https://issues.apache.org/jira/browse/CASSANDRA-3831 <
>> https://issues.apache.org/jira/browse/CASSANDRA-3831>
>>
>> We are submitting SLCK for publication soon, and we can send you a draft
>> a month from now if you are interested.
>>
>> To make a stronger publication submission, beyond reproducing old bugs,
>> we thought it would be great if SLCK can reproduce new scalability bugs (if
>> any) that you are still trying to resolve.
>>
>> We hope you find our work interesting and we would really appreciate if
>> you can point to us any new scalability bugs that hopefully we can help you
>> reproduce.
>>
>> Thank you very much for your attention!
>>
>> Best,
>> Tanakorn L.
>>
>>
>>
>>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>

Re: Research on scalability bug finder for Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.

Sounds very interesting!  We'd love to hear more about your approach.  In
particular, does it require a patched version of Cassandra?

On Thu, Apr 7, 2016 at 6:18 PM, Tanakorn Leesatapornwongsa <
tanakorn@cs.uchicago.edu> wrote:

> Dear Cassandra development team,
>
> We are computer science researchers at the University of Chicago.  Our
> research is about the reliability of cloud-scale distributed systems.
> Samples of our work can be found here: http://ucare.cs.uchicago.edu <
> http://ucare.cs.uchicago.edu/>
>
> We are reaching out to you because we are interested in reproducing any
> unsolved scalability bugs in Cassandra.
>
> We define scalability bugs as latent bugs that are scale-dependent.  They
> don't arise in small-scale deployment but arise in large-scale production
> runs.  For example, everything is fine in 100-node deployment but in
> 500-node deployment the bug appears.
>
> We have created a scale-check methodology (SLCK) that can unearth
> scalability bugs in a single machine.  With SLCK, we can run hundreds of
> nodes on a single machine and reproduce some old scalability bugs. For
> example, we have reproduced the following bugs in one machine:
>
> - https://issues.apache.org/jira/browse/CASSANDRA-6127 <
> https://issues.apache.org/jira/browse/CASSANDRA-6127>   (a customer
> observed node flapping when bootstrapping 1000 nodes)
>
> - https://issues.apache.org/jira/browse/CASSANDRA-3831 <
> https://issues.apache.org/jira/browse/CASSANDRA-3831>
>
> We are submitting SLCK for publication soon, and we can send you a draft a
> month from now if you are interested.
>
> To make a stronger publication submission, beyond reproducing old bugs, we
> thought it would be great if SLCK can reproduce new scalability bugs (if
> any) that you are still trying to resolve.
>
> We hope you find our work interesting and we would really appreciate if
> you can point to us any new scalability bugs that hopefully we can help you
> reproduce.
>
> Thank you very much for your attention!
>
> Best,
> Tanakorn L.
>
>
>
>


-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced