You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Tim Heckman <ti...@pagerduty.com> on 2014/09/08 20:08:10 UTC

"Failed to enable shuffling" error

Hello,

I'm looking to convert our recently upgraded Cassandra cluster from a
single token per node to using vnodes. We've determined that based on
our data consistency and usage patterns that shuffling will be the
best way to convert our live cluster.

However, when following the instructions for doing the shuffle, we
aren't able to enable shuffling on the other 4 nodes in the cluster.
We get the error message 'Failed to enable shuffling', which looks to
be a generic string printed when a JMX IOException is caught.
Unfortunately, the underlying error is not printed so I'm effectively
troubleshooting in the dark.

I've done some mailing list diving, as well as Google skimming, and
all the suggestions did not seem to work.

I've confirmed that a firewall is not the cause as I am able to
establish a TCP socket (using telnet) from one node to the other. I've
also double-checked the JMX-specific settings that are being set for
Cassandra and those look good. I'm going with the most open settings
now to try and get this working:

-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false

I also tried playing with the 'java.rmi.server.hostname' setting, but
none of the options set seemed to make a difference (hostname, fqdn,
public IPv4 address, private EC2 address).

Without any further information from the 'cassandra-shuffle' utility
I'm pretty much out of ideas. Any suggestions would be greatly
appreciated!

Cheers!
-Tim

Re: "Failed to enable shuffling" error

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Thrift is still present in the 2.0 branch as well as 2.1.  Where did
you see that it's deprecated?

Let me elaborate my earlier advice.  Shuffle was removed because it
doesn't work for anything beyond a trivial dataset.  It is definitely
"more risky" than adding a new vnode enabled DC, as it does not work
at all.

On Mon, Sep 8, 2014 at 2:01 PM, Tim Heckman <ti...@pagerduty.com> wrote:
> On Mon, Sep 8, 2014 at 1:45 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:
>> I believe shuffle has been removed recently.  I do not recommend using
>> it for any reason.
>
> We're still using the 1.2.x branch of Cassandra, and will be for some
> time due to the thrift deprecation. Has it only been removed from the
> 2.x line?
>
>> If you really want to go vnodes, your only sane option is to add a new
>> DC that uses vnodes and switch to it.
>
> We use the NetworkTopologyStrategy across three geographically
> separated regions. Doing it this way feels a bit more risky based on
> our replication strategy. Also, I'm not sure where all we have our
> current datacenter names defined across our different internal
> repositories. So there could be quite a large number of changes going
> this route.
>
>> The downside in the 2.0.x branch to using vnodes is that repairs take
>> N times as long, where N is the number of tokens you put on each node.
>> I can't think of any other reasons why you wouldn't want to use vnodes
>> (but this may be significant enough for you by itself)
>>
>> 2.1 should address the repair issue for most use cases.
>>
>> Jon
>
> Thank you for the notes on the behaviors in the 2.x branch. If we do
> move to the 2.x version that's something we'll be keeping in mind.
>
> Cheers!
> -Tim
>
>> On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>> On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman <ti...@pagerduty.com> wrote:
>>>>
>>>> We're still at the exploratory stage on systems that are not
>>>> production-facing but contain production-like data. Based on our
>>>> placement strategy we have some concerns that the new datacenter
>>>> approach may be riskier or more difficult. We're just trying to gauge
>>>> both paths and see what works best for us.
>>>
>>>
>>> Your case of RF=N is probably the best possible case for shuffle, but
>>> general statements about how much this code has been exercised remain. :)
>>>
>>>>
>>>> The cluster I'm testing this on is a 5 node cluster with a placement
>>>> strategy such that all nodes contain 100% of the data. In practice we
>>>> have six clusters of similar size that are used for different
>>>> services. These different clusters may need additional capacity at
>>>> different times, so it's hard to answer the maximum size question. For
>>>> now let's just assume that the clusters may never see an 11th
>>>> member... but no guarantees.
>>>
>>>
>>> With RF of 3, cluster sizes of under approximately 10 tend to net lose from
>>> vnodes. If these clusters are not very likely to ever have more than 10
>>> nodes, consider not using Vnodes.
>>>
>>>>
>>>> We're looking to use vnodes to help with easing the administrative
>>>> work of scaling out the cluster. The improvements of streaming data
>>>> during repairs amongst others.
>>>
>>>
>>> Most of these wins don't occur until you have a lot of nodes, but the fixed
>>> costs of having many ranges are paid all the time.
>>>
>>>>
>>>> For shuffle, it looks like it may be easier than adding a new
>>>> datacenter and then have to adjust the schema for a new "datacenter"
>>>> to come to life. And we weren't sure whether the same pitfalls of
>>>> shuffle would effect us while having all data on all nodes.
>>>
>>>
>>> Let us know! Good luck!
>>>
>>> =Rob
>>>
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade



-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: "Failed to enable shuffling" error

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Sep 8, 2014 at 2:01 PM, Tim Heckman <ti...@pagerduty.com> wrote:

> We're still using the 1.2.x branch of Cassandra, and will be for some
> time due to the thrift deprecation. Has it only been removed from the
> 2.x line?
>

Other than the fact that 2.0.x is not production ready yet, there's no
reason not to go to newer versions. Thrift is deprecated and unmaintained
in versions above 2.0 but is unlikely to be actually removed from the
codebase for at least another 3 or 4 major versions. In fact, the official
statement is that there are no plans to remove it; let us hope that's not
true.

=Rob

Re: "Failed to enable shuffling" error

Posted by Tim Heckman <ti...@pagerduty.com>.

On Mon, Sep 8, 2014 at 1:45 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:
> I believe shuffle has been removed recently.  I do not recommend using
> it for any reason.

We're still using the 1.2.x branch of Cassandra, and will be for some
time due to the thrift deprecation. Has it only been removed from the
2.x line?

> If you really want to go vnodes, your only sane option is to add a new
> DC that uses vnodes and switch to it.

We use the NetworkTopologyStrategy across three geographically
separated regions. Doing it this way feels a bit more risky based on
our replication strategy. Also, I'm not sure where all we have our
current datacenter names defined across our different internal
repositories. So there could be quite a large number of changes going
this route.

> The downside in the 2.0.x branch to using vnodes is that repairs take
> N times as long, where N is the number of tokens you put on each node.
> I can't think of any other reasons why you wouldn't want to use vnodes
> (but this may be significant enough for you by itself)
>
> 2.1 should address the repair issue for most use cases.
>
> Jon

Thank you for the notes on the behaviors in the 2.x branch. If we do
move to the 2.x version that's something we'll be keeping in mind.

Cheers!
-Tim

> On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli <rc...@eventbrite.com> wrote:
>> On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman <ti...@pagerduty.com> wrote:
>>>
>>> We're still at the exploratory stage on systems that are not
>>> production-facing but contain production-like data. Based on our
>>> placement strategy we have some concerns that the new datacenter
>>> approach may be riskier or more difficult. We're just trying to gauge
>>> both paths and see what works best for us.
>>
>>
>> Your case of RF=N is probably the best possible case for shuffle, but
>> general statements about how much this code has been exercised remain. :)
>>
>>>
>>> The cluster I'm testing this on is a 5 node cluster with a placement
>>> strategy such that all nodes contain 100% of the data. In practice we
>>> have six clusters of similar size that are used for different
>>> services. These different clusters may need additional capacity at
>>> different times, so it's hard to answer the maximum size question. For
>>> now let's just assume that the clusters may never see an 11th
>>> member... but no guarantees.
>>
>>
>> With RF of 3, cluster sizes of under approximately 10 tend to net lose from
>> vnodes. If these clusters are not very likely to ever have more than 10
>> nodes, consider not using Vnodes.
>>
>>>
>>> We're looking to use vnodes to help with easing the administrative
>>> work of scaling out the cluster. The improvements of streaming data
>>> during repairs amongst others.
>>
>>
>> Most of these wins don't occur until you have a lot of nodes, but the fixed
>> costs of having many ranges are paid all the time.
>>
>>>
>>> For shuffle, it looks like it may be easier than adding a new
>>> datacenter and then have to adjust the schema for a new "datacenter"
>>> to come to life. And we weren't sure whether the same pitfalls of
>>> shuffle would effect us while having all data on all nodes.
>>
>>
>> Let us know! Good luck!
>>
>> =Rob
>>
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade

Re: "Failed to enable shuffling" error

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

I believe shuffle has been removed recently.  I do not recommend using
it for any reason.

If you really want to go vnodes, your only sane option is to add a new
DC that uses vnodes and switch to it.

The downside in the 2.0.x branch to using vnodes is that repairs take
N times as long, where N is the number of tokens you put on each node.
I can't think of any other reasons why you wouldn't want to use vnodes
(but this may be significant enough for you by itself)

2.1 should address the repair issue for most use cases.

Jon


On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman <ti...@pagerduty.com> wrote:
>>
>> We're still at the exploratory stage on systems that are not
>> production-facing but contain production-like data. Based on our
>> placement strategy we have some concerns that the new datacenter
>> approach may be riskier or more difficult. We're just trying to gauge
>> both paths and see what works best for us.
>
>
> Your case of RF=N is probably the best possible case for shuffle, but
> general statements about how much this code has been exercised remain. :)
>
>>
>> The cluster I'm testing this on is a 5 node cluster with a placement
>> strategy such that all nodes contain 100% of the data. In practice we
>> have six clusters of similar size that are used for different
>> services. These different clusters may need additional capacity at
>> different times, so it's hard to answer the maximum size question. For
>> now let's just assume that the clusters may never see an 11th
>> member... but no guarantees.
>
>
> With RF of 3, cluster sizes of under approximately 10 tend to net lose from
> vnodes. If these clusters are not very likely to ever have more than 10
> nodes, consider not using Vnodes.
>
>>
>> We're looking to use vnodes to help with easing the administrative
>> work of scaling out the cluster. The improvements of streaming data
>> during repairs amongst others.
>
>
> Most of these wins don't occur until you have a lot of nodes, but the fixed
> costs of having many ranges are paid all the time.
>
>>
>> For shuffle, it looks like it may be easier than adding a new
>> datacenter and then have to adjust the schema for a new "datacenter"
>> to come to life. And we weren't sure whether the same pitfalls of
>> shuffle would effect us while having all data on all nodes.
>
>
> Let us know! Good luck!
>
> =Rob
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: "Failed to enable shuffling" error

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman <ti...@pagerduty.com> wrote:

> We're still at the exploratory stage on systems that are not
> production-facing but contain production-like data. Based on our
> placement strategy we have some concerns that the new datacenter
> approach may be riskier or more difficult. We're just trying to gauge
> both paths and see what works best for us.

Your case of RF=N is probably the best possible case for shuffle, but
general statements about how much this code has been exercised remain. :)

> The cluster I'm testing this on is a 5 node cluster with a placement
> strategy such that all nodes contain 100% of the data. In practice we
> have six clusters of similar size that are used for different
> services. These different clusters may need additional capacity at
> different times, so it's hard to answer the maximum size question. For
> now let's just assume that the clusters may never see an 11th
> member... but no guarantees.
>

With RF of 3, cluster sizes of under approximately 10 tend to net lose from
vnodes. If these clusters are not very likely to ever have more than 10
nodes, consider not using Vnodes.

> We're looking to use vnodes to help with easing the administrative
> work of scaling out the cluster. The improvements of streaming data
> during repairs amongst others.
>

Most of these wins don't occur until you have a lot of nodes, but the fixed
costs of having many ranges are paid all the time.

> For shuffle, it looks like it may be easier than adding a new
> datacenter and then have to adjust the schema for a new "datacenter"
> to come to life. And we weren't sure whether the same pitfalls of
> shuffle would effect us while having all data on all nodes.
>

Let us know! Good luck!

=Rob

Re: "Failed to enable shuffling" error

Posted by Tim Heckman <ti...@pagerduty.com>.

On Mon, Sep 8, 2014 at 11:19 AM, Robert Coli <rc...@eventbrite.com> wrote:
> On Mon, Sep 8, 2014 at 11:08 AM, Tim Heckman <ti...@pagerduty.com> wrote:
>>
>> I'm looking to convert our recently upgraded Cassandra cluster from a
>> single token per node to using vnodes. We've determined that based on
>> our data consistency and usage patterns that shuffling will be the
>> best way to convert our live cluster.
>
>
> You apparently haven't read anything else about shuffling, or you would have
> learned that no one has ever successfully done it in a real production
> cluster. ;)

I've definitely seen the horror stories that have come out of shuffle.
:) We plan on giving this a trial run on production-sized data before
actually doing it on our production hardware.

>>
>> Unfortunately, the underlying error is not printed so I'm effectively
>> troubleshooting in the dark.
>
>
> This mysterious error is protecting you from a probably quite negative
> experience with shuffle.

We're still at the exploratory stage on systems that are not
production-facing but contain production-like data. Based on our
placement strategy we have some concerns that the new datacenter
approach may be riskier or more difficult. We're just trying to gauge
both paths and see what works best for us.

>>
>> I've done some mailing list diving, as well as Google skimming, and
>> all the suggestions did not seem to work.
>
>
> What version of Cassandra are you running? I would not be surprised if
> shuffle is in fact completely broken in 2.0.x release, not only hazardous to
> attempt.
>
> Why do you believe that you want to shuffle and/or enable vnodes? How large
> is the cluster and how large is it likely to become?

We're still back on the 1.2 version of Cass, specifically 1.2.16 for
the majority of our clusters with one cluster having seen its
inception after the 1.2.18 release.

The cluster I'm testing this on is a 5 node cluster with a placement
strategy such that all nodes contain 100% of the data. In practice we
have six clusters of similar size that are used for different
services. These different clusters may need additional capacity at
different times, so it's hard to answer the maximum size question. For
now let's just assume that the clusters may never see an 11th
member... but no guarantees.

We're looking to use vnodes to help with easing the administrative
work of scaling out the cluster. The improvements of streaming data
during repairs amongst others.

For shuffle, it looks like it may be easier than adding a new
datacenter and then have to adjust the schema for a new "datacenter"
to come to life. And we weren't sure whether the same pitfalls of
shuffle would effect us while having all data on all nodes.

> =Rob
>

Thanks for the quick reply, Rob.

-Tim

Re: "Failed to enable shuffling" error

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Sep 8, 2014 at 11:08 AM, Tim Heckman <ti...@pagerduty.com> wrote:

> I'm looking to convert our recently upgraded Cassandra cluster from a
> single token per node to using vnodes. We've determined that based on
> our data consistency and usage patterns that shuffling will be the
> best way to convert our live cluster.
>

You apparently haven't read anything else about shuffling, or you would
have learned that no one has ever successfully done it in a real production
cluster. ;)

> Unfortunately, the underlying error is not printed so I'm effectively
> troubleshooting in the dark.
>

This mysterious error is protecting you from a probably quite negative
experience with shuffle.

> I've done some mailing list diving, as well as Google skimming, and
> all the suggestions did not seem to work.
>

What version of Cassandra are you running? I would not be surprised if
shuffle is in fact completely broken in 2.0.x release, not only hazardous
to attempt.

Why do you believe that you want to shuffle and/or enable vnodes? How large
is the cluster and how large is it likely to become?

=Rob