You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Ryan King <ry...@twitter.com> on 2010/02/22 20:53:04 UTC

Re: thinking about dropping hinted handoff

On Wed, Jan 27, 2010 at 12:02 PM, Ryan King <ry...@twitter.com> wrote:
> On Wed, Jan 27, 2010 at 11:49 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Wed, Jan 27, 2010 at 1:48 PM, Stu Hood <st...@rackspace.com> wrote:
>>>> The HH code currently tries to send the hints to nodes other than the
>>>> natural endpoints. If small-scale performance is a problem, we could
>>>> make the natural endpoints be responsible for the hints. This reduces
>>>> durability a bit, but might be a decent tradeoff.
>>> The other interesting benefit is that the hint would not need to store the actual content of the change, since the natural endpoints will already be storing copies. The hints would just need to encode the fact that a given (key,name1[,name2]) changed.
>>
>> Right, I think that's what Ryan was getting at.
>
> Indeed. Like I said, this change only helps with the "surprising
> effects on a small cluster" problem, but if that's enough, perhaps we
> should do it.

So, after having some more experience with HH, I've reformed my
opinion. I think we have 3 options:

1. Make the natural endpoints responsible for the hints.
2. Make a random node responsible for hints.
3. Get rid of HH.

#1 reduces the "surprising effects in a small cluster" problem by
adding a marginal amount of resource demands to nodes that already
have the data we need.

#2 will spread the load out. We had a node die last week and decided
to leave it down so that we could learn about the effects of this
situation. We eventually ended up killing the next node on the ring
with all the hints (I think there some improvements to this in 0.6,
but I don't know if they'll be enough). So, even on a large cluster
(ours is currently 45 nodes), HH can have surprising effects on nodes
that neighbor a node that's down. Picking either a random node or
using the coordinator node for the hint would spread the load out.

#3 is, I think, the right answer. It make our system simpler and it
makes the behavior in failure conditions more predictable and safe.

What do you all think?

-ryan

Re: thinking about dropping hinted handoff

Posted by Ryan King <ry...@twitter.com>.

On Mon, Feb 22, 2010 at 5:01 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 6:57 PM, Ryan King <ry...@twitter.com> wrote:
>> I think I find it more compelling because we're currently experiencing
>> pain related to HH. I'd be ok with keeping it as long as we can make
>> the effects of a node down be less drastic.
>
> Can you open a ticket and tag it 0.6?  I think I can implement #1
> easily.  If I am wrong I will push to 0.7.

Done https://issues.apache.org/jira/browse/CASSANDRA-822

-ryan

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Feb 22, 2010 at 6:57 PM, Ryan King <ry...@twitter.com> wrote:
> I think I find it more compelling because we're currently experiencing
> pain related to HH. I'd be ok with keeping it as long as we can make
> the effects of a node down be less drastic.

Can you open a ticket and tag it 0.6?  I think I can implement #1
easily.  If I am wrong I will push to 0.7.

-Jonathan

Re: thinking about dropping hinted handoff

Posted by Ryan King <ry...@twitter.com>.

On Mon, Feb 22, 2010 at 2:05 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 1:53 PM, Ryan King <ry...@twitter.com> wrote:
>> So, after having some more experience with HH, I've reformed my
>> opinion. I think we have 3 options:
>>
>> 1. Make the natural endpoints responsible for the hints.
>> 2. Make a random node responsible for hints.
>> 3. Get rid of HH.
>>
>> #1 reduces the "surprising effects in a small cluster" problem by
>> adding a marginal amount of resource demands to nodes that already
>> have the data we need.
>>
>> #2 will spread the load out. We had a node die last week and decided
>> to leave it down so that we could learn about the effects of this
>> situation. We eventually ended up killing the next node on the ring
>> with all the hints (I think there some improvements to this in 0.6,
>> but I don't know if they'll be enough). So, even on a large cluster
>> (ours is currently 45 nodes), HH can have surprising effects on nodes
>> that neighbor a node that's down. Picking either a random node or
>> using the coordinator node for the hint would spread the load out.
>>
>> #3 is, I think, the right answer. It make our system simpler and it
>> makes the behavior in failure conditions more predictable and safe.
>
> This is a good summary of the options.
>
> Why do you find 3 more compelling than 1?  Yes, it's simpler, but 1
> would not require a large change to the exiting code, so perhaps we
> need a better case than that to justify removing a feature that
> already (mostly) works.

I think I find it more compelling because we're currently experiencing
pain related to HH. I'd be ok with keeping it as long as we can make
the effects of a node down be less drastic.

-ryan

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Feb 22, 2010 at 1:53 PM, Ryan King <ry...@twitter.com> wrote:
> So, after having some more experience with HH, I've reformed my
> opinion. I think we have 3 options:
>
> 1. Make the natural endpoints responsible for the hints.
> 2. Make a random node responsible for hints.
> 3. Get rid of HH.
>
> #1 reduces the "surprising effects in a small cluster" problem by
> adding a marginal amount of resource demands to nodes that already
> have the data we need.
>
> #2 will spread the load out. We had a node die last week and decided
> to leave it down so that we could learn about the effects of this
> situation. We eventually ended up killing the next node on the ring
> with all the hints (I think there some improvements to this in 0.6,
> but I don't know if they'll be enough). So, even on a large cluster
> (ours is currently 45 nodes), HH can have surprising effects on nodes
> that neighbor a node that's down. Picking either a random node or
> using the coordinator node for the hint would spread the load out.
>
> #3 is, I think, the right answer. It make our system simpler and it
> makes the behavior in failure conditions more predictable and safe.

This is a good summary of the options.

Why do you find 3 more compelling than 1?  Yes, it's simpler, but 1
would not require a large change to the exiting code, so perhaps we
need a better case than that to justify removing a feature that
already (mostly) works.

-Jonathan

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Feb 22, 2010 at 6:56 PM, Ryan King <ry...@twitter.com> wrote:
> Maybe I mis-read the code, but I thought it was trigger for every compaction.

Full compactions is correct, at least that is the correct design. :)

-Jonathan

Re: thinking about dropping hinted handoff

Posted by Ryan King <ry...@twitter.com>.

2010/2/22 Peter Schüller <sc...@spotify.com>:
>> #3 is, I think, the right answer. It make our system simpler and it
>> makes the behavior in failure conditions more predictable and safe.
>
> Any thoughts on time-to-self-heal? My impression browsing the code,
> and it seems to be confirmed by some wiki material, is that
> anti-entropy is triggered only during full compactations. While hinted
> handoff is never a guarantee, doing without it completely probably
> increases the urgency of anti-entropy.

Maybe I mis-read the code, but I thought it was trigger for every compaction.

-ryan

> In general, what are people's thoughts on the appropriate mechanism to
> gain confidence that the cluster as a whole is reasonably consistent?
> In particular in relation to performing maintenance that may require
> popping nodes in and out in some kind of rolling fashion. Are full
> compactations expected to be something you would want to trigger
> semi-regularly on production clusters by hand?
>
> --
> / Peter Schuller aka scode
>

Re: consistent backups

Posted by Jonathan Ellis <jb...@gmail.com>.

Go ahead.

2010/2/25 Ted Zlatanov <tz...@lifelogs.com>:
> On Thu, 25 Feb 2010 08:22:38 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>
> JE> 2010/2/25 Ted Zlatanov <tz...@lifelogs.com>:
>>> I want a consistent backup.
>
> JE> You can get an "eventually consistent backup" by flushing all nodes
> JE> and snapshotting; no individual node's backup is guaranteed to be
> JE> consistent but if you restore from that snapshot then clients will get
> JE> eventually consistent behavior as usual.
>
> JE> Other than that there is no such thing as a "consistent view of the
> JE> data" in the strict sense, except in the trivial case of writes with
> JE> CL.ALL.
>
> That makes perfect sense, thanks for explaining.  Can the explanation be
> part of the http://wiki.apache.org/cassandra/Operations section on
> backups?  I'll submit the edit if you want.
>
> Thanks
> Ted
>
>

Re: consistent backups

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Thu, 25 Feb 2010 08:22:38 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/25 Ted Zlatanov <tz...@lifelogs.com>:
>> I want a consistent backup.

JE> You can get an "eventually consistent backup" by flushing all nodes
JE> and snapshotting; no individual node's backup is guaranteed to be
JE> consistent but if you restore from that snapshot then clients will get
JE> eventually consistent behavior as usual.

JE> Other than that there is no such thing as a "consistent view of the
JE> data" in the strict sense, except in the trivial case of writes with
JE> CL.ALL.

That makes perfect sense, thanks for explaining.  Can the explanation be
part of the http://wiki.apache.org/cassandra/Operations section on
backups?  I'll submit the edit if you want.

Thanks
Ted

Re: consistent backups (was: thinking about dropping hinted handoff)

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/2/25 Ted Zlatanov <tz...@lifelogs.com>:
> I want a consistent backup.

You can get an "eventually consistent backup" by flushing all nodes
and snapshotting; no individual node's backup is guaranteed to be
consistent but if you restore from that snapshot then clients will get
eventually consistent behavior as usual.

Other than that there is no such thing as a "consistent view of the
data" in the strict sense, except in the trivial case of writes with
CL.ALL.

-Jonathan

consistent backups (was: thinking about dropping hinted handoff)

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Tue, 23 Feb 2010 19:37:14 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote: 

TZ> You're probably right, though I don't see a better way and don't
TZ> understand what makes this one problematic.  I thought it would at least
TZ> let the cluster keep serving read requests while settling into a steady
TZ> state.

OK, let me restate the question, since I think I should be clear in what
I'm trying to accomplish.

I want a consistent backup.  Is following
http://wiki.apache.org/cassandra/Operations (section "Backups")
sufficient?  Is there a danger that uncomitted data will not be
captured?  If so, what can I do (avoiding, if possible, a cluster
shutdown) to get a consistent view of the data?

I'm asking on the developer list to keep the thread in the same place
but this is probably a user-level question.

Ted

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Tue, 23 Feb 2010 16:44:29 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
JE> because in a masterless environment there is no way to tell "when it's over"
>> 
>> Would it work to use an external agent?  It can get the list of nodes,
>> make them all read-only, then wait until every node reports no write
>> activity through JMX.

JE> At that point I'd say you're deeply into "cure worse than the disease"
JE> territory :)

You're probably right, though I don't see a better way and don't
understand what makes this one problematic.  I thought it would at least
let the cluster keep serving read requests while settling into a steady
state.

Ted

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
> JE> because in a masterless environment there is no way to tell "when it's over"
>
> Would it work to use an external agent?  It can get the list of nodes,
> make them all read-only, then wait until every node reports no write
> activity through JMX.

At that point I'd say you're deeply into "cure worse than the disease"
territory :)

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Tue, 23 Feb 2010 15:57:15 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
>> You're welcome.  I don't understand why it doesn't help reach
>> consistency, though.  If you turn all the nodes in a cluster read-only
>> at the API level, what can make them inconsistent besides inter-node
>> traffic and scheduled writes?  I'd assume that activity will die down
>> eventually; can Cassandra tell a monitoring agent through JMX when it is
>> over?

JE> because in a masterless environment there is no way to tell "when it's over"

Would it work to use an external agent?  It can get the list of nodes,
make them all read-only, then wait until every node reports no write
activity through JMX.

Ted

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
> You're welcome.  I don't understand why it doesn't help reach
> consistency, though.  If you turn all the nodes in a cluster read-only
> at the API level, what can make them inconsistent besides inter-node
> traffic and scheduled writes?  I'd assume that activity will die down
> eventually; can Cassandra tell a monitoring agent through JMX when it is
> over?

because in a masterless environment there is no way to tell "when it's over"

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Tue, 23 Feb 2010 15:25:29 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
>>>> Can a Cassandra node be made read-only (as far as clients know)?
>> 
JE> no.
>> 
>> Is there value (for reaching consistency) in adding that functionality?

JE> No.

JE> Thanks for the easy questions today. :)

You're welcome.  I don't understand why it doesn't help reach
consistency, though.  If you turn all the nodes in a cluster read-only
at the API level, what can make them inconsistent besides inter-node
traffic and scheduled writes?  I'd assume that activity will die down
eventually; can Cassandra tell a monitoring agent through JMX when it is
over?

Ted

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
>>> Can a Cassandra node be made read-only (as far as clients know)?
>
> JE> no.
>
> Is there value (for reaching consistency) in adding that functionality?

No.

Thanks for the easy questions today. :)

-Jonathan

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Tue, 23 Feb 2010 13:49:37 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
>> On Mon, 22 Feb 2010 21:12:58 +0100 Peter Schüller <sc...@spotify.com> wrote:
>> 
PS> In general, what are people's thoughts on the appropriate mechanism to
PS> gain confidence that the cluster as a whole is reasonably consistent?
PS> In particular in relation to performing maintenance that may require
PS> popping nodes in and out in some kind of rolling fashion.
>> 
>> Can a Cassandra node be made read-only (as far as clients know)?

JE> no.

Is there value (for reaching consistency) in adding that functionality?

Ted

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

no.

2010/2/23 Ted Zlatanov <tz...@lifelogs.com>:
> On Mon, 22 Feb 2010 21:12:58 +0100 Peter Schüller <sc...@spotify.com> wrote:
>
> PS> In general, what are people's thoughts on the appropriate mechanism to
> PS> gain confidence that the cluster as a whole is reasonably consistent?
> PS> In particular in relation to performing maintenance that may require
> PS> popping nodes in and out in some kind of rolling fashion.
>
> Can a Cassandra node be made read-only (as far as clients know)?
>
> Ted
>
>

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Wed, 10 Mar 2010 15:59:55 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> Read-only for a specific client is completely different from trying to
JE> read-only the entire node / cluster.  So no, nothing wrong with that.

Cool, thanks.  See CASSANDRA-900 for my proposal.

Ted

Re: thinking about dropping hinted handoff

Posted by Jonathan Ellis <jb...@gmail.com>.

Read-only for a specific client is completely different from trying to
read-only the entire node / cluster.  So no, nothing wrong with that.

2010/3/10 Ted Zlatanov <tz...@lifelogs.com>:
> On Fri, 26 Feb 2010 08:18:49 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote:
>
> TZ> On Tue, 23 Feb 2010 12:30:52 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote:
>> Can a Cassandra node be made read-only (as far as clients know)?
>
> TZ> I realized I have another use case for read-only access besides backups:
>
> TZ> On our network we have Cassandra readers, writers, and analyzers
> TZ> (read+write).  The writers and analyzers can run anywhere.  The readers
> TZ> can run anywhere too.  I don't want the readers to have write access but
> TZ> they should be able to read all keyspaces.
>
> TZ> I think the best way to solve this is with an IAuthenticator change to
> TZ> distinguish between full permissions and read-only permissions.  Then
> TZ> the Thrift API has to be modified to check for write access in only some
> TZ> functions:
> ...
> TZ> Does this seem reasonable?
>
> Any comments, while we're discussing authentication?  I think read-only
> access makes a lot of sense in this context.
>
> Ted
>
>

Re: Latest svn code

Posted by Jonathan Ellis <jb...@gmail.com>.

Both.

The latest 0.6 code is in the 0.6 branch.

The latest trunk code (will become 0.7) is in trunk.

Trunk is in "breaking stuff" mode right now.

On Wed, Mar 10, 2010 at 9:50 AM, David Dabbs <dm...@gmail.com> wrote:
> Hi. Is the latest code in trunk or the 0.6 branch?
>
> Thanks,
>
> david
>
>
>
>

Latest svn code

Posted by David Dabbs <dm...@gmail.com>.

Hi. Is the latest code in trunk or the 0.6 branch?

Thanks,

david

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Fri, 26 Feb 2010 08:18:49 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote: 

TZ> On Tue, 23 Feb 2010 12:30:52 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote: 
> Can a Cassandra node be made read-only (as far as clients know)?

TZ> I realized I have another use case for read-only access besides backups:

TZ> On our network we have Cassandra readers, writers, and analyzers
TZ> (read+write).  The writers and analyzers can run anywhere.  The readers
TZ> can run anywhere too.  I don't want the readers to have write access but
TZ> they should be able to read all keyspaces.

TZ> I think the best way to solve this is with an IAuthenticator change to
TZ> distinguish between full permissions and read-only permissions.  Then
TZ> the Thrift API has to be modified to check for write access in only some
TZ> functions:
...
TZ> Does this seem reasonable?

Any comments, while we're discussing authentication?  I think read-only
access makes a lot of sense in this context.

Ted

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Tue, 23 Feb 2010 12:30:52 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote: 

TZ> Can a Cassandra node be made read-only (as far as clients know)?

I realized I have another use case for read-only access besides backups:

On our network we have Cassandra readers, writers, and analyzers
(read+write).  The writers and analyzers can run anywhere.  The readers
can run anywhere too.  I don't want the readers to have write access but
they should be able to read all keyspaces.

I think the best way to solve this is with an IAuthenticator change to
distinguish between full permissions and read-only permissions.  Then
the Thrift API has to be modified to check for write access in only some
functions:

insert
batch_insert
remove
batch_mutate

I can make the necessary changes to the Avro API as well.

The work will require a change to login() to make it return an enum:

enum AuthorizedAccessLevel {
    NONE = 0,
    READ = 16,
    WRITE = 32,
}

AuthorizedAccessLevel login(1: required string keyspace, 2:required AuthenticationRequest auth_request) throws (1:AuthenticationException authnx, 2:AuthorizationException authzx),

...and that's pretty much it.  Since login() used to be void, the change
is painless and will basically be a change of the loginDone ThreadLocal
in CassandraServer from Boolean to AuthorizedAccessLevel.

I left room between the enums in case we need future expansion,
e.g. "insert-only" for collectors that can't remove() or
batch_mutate().

Does this seem reasonable?

Thanks
Ted

Re: thinking about dropping hinted handoff

Posted by Ted Zlatanov <tz...@lifelogs.com>.

On Mon, 22 Feb 2010 21:12:58 +0100 Peter Schüller <sc...@spotify.com> wrote: 

PS> In general, what are people's thoughts on the appropriate mechanism to
PS> gain confidence that the cluster as a whole is reasonably consistent?
PS> In particular in relation to performing maintenance that may require
PS> popping nodes in and out in some kind of rolling fashion.

Can a Cassandra node be made read-only (as far as clients know)?

Ted

Re: thinking about dropping hinted handoff

Posted by Peter Schüller <sc...@spotify.com>.

> #3 is, I think, the right answer. It make our system simpler and it
> makes the behavior in failure conditions more predictable and safe.

Any thoughts on time-to-self-heal? My impression browsing the code,
and it seems to be confirmed by some wiki material, is that
anti-entropy is triggered only during full compactations. While hinted
handoff is never a guarantee, doing without it completely probably
increases the urgency of anti-entropy.

In general, what are people's thoughts on the appropriate mechanism to
gain confidence that the cluster as a whole is reasonably consistent?
In particular in relation to performing maintenance that may require
popping nodes in and out in some kind of rolling fashion. Are full
compactations expected to be something you would want to trigger
semi-regularly on production clusters by hand?

-- 
/ Peter Schuller aka scode