You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jeremy Hanna <je...@gmail.com> on 2018/10/16 14:10:35 UTC

Deprecating/removing PropertyFileSnitch?

We have had PropertyFileSnitch for a long time even though GossipingPropertyFileSnitch is effectively a superset of what it offers and is much less error prone.  There are some unexpected behaviors when things aren’t configured correctly with PFS.  For example, if you replace nodes in one DC and add those nodes to that DCs property files and not the other DCs property files - the resulting problems aren’t very straightforward to troubleshoot.

We could try to improve the resilience and fail fast error checking and error reporting of PFS, but honestly, why wouldn’t we deprecate and remove PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient to replace it?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by Sankalp Kohli <ko...@gmail.com>.
+1 to fallback and like I said before removing PFS is a good idea as long it is safe 

> On Oct 22, 2018, at 7:41 PM, Jeff Jirsa <jj...@gmail.com> wrote:
> 
> On Mon, Oct 22, 2018 at 7:09 PM J. D. Jordan <je...@gmail.com>
> wrote:
> 
>> Do you have a specific gossip bug that you have seen recently which caused
>> a problem that would make this happen?  Do you have a specific JIRA in mind?
> 
> 
> Sankalp linked a few others, but also
> https://issues.apache.org/jira/browse/CASSANDRA-13700
> 
> 
>>  “We can’t remove this because what if there is a bug” doesn’t seem like
>> a good enough reason to me. If that was a reason we would never make any
>> changes to anything.
>> 
> 
> How about "we know that certain fields that are gossiped go missing even
> after all of the known races are fixed, so removing an existing
> low-maintenance feature and forcing users to rely on gossip for topology
> may be worth some discussion".
> 
> 
>> I think many people have seen PFS actually cause real problems, where with
>> GPFS the issue being talked about is predicated on some theoretical gossip
>> bug happening.
>> 
> 
> How many of those were actually caused by incorrect fallback from GPFS to
> PFS, rather than PFS itself?
> 
> 
>> In the past year at DataStax we have done a lot of testing on 3.0 and 3.11
>> around adding nodes, adding DC’s, replacing nodes, replacing racks, and
>> replacing DC’s, all while using GPFS, and as far as I know we have not seen
>> any “lost” rack/DC information during such testing.
>> 
> 
> I've also run very large GPFS clusters in the past without much gossip
> pain, and I'm in the "we should deprecate PFS" camp, but it is also true
> that PFS is low maintenance and mostly works. Perhaps the first step is
> breaking the GPFS->PFS fallback that people don't know about, maybe that'll
> help?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeff Jirsa <jj...@gmail.com>.
On Mon, Oct 22, 2018 at 7:09 PM J. D. Jordan <je...@gmail.com>
wrote:

> Do you have a specific gossip bug that you have seen recently which caused
> a problem that would make this happen?  Do you have a specific JIRA in mind?


Sankalp linked a few others, but also
https://issues.apache.org/jira/browse/CASSANDRA-13700


>   “We can’t remove this because what if there is a bug” doesn’t seem like
> a good enough reason to me. If that was a reason we would never make any
> changes to anything.
>

How about "we know that certain fields that are gossiped go missing even
after all of the known races are fixed, so removing an existing
low-maintenance feature and forcing users to rely on gossip for topology
may be worth some discussion".


> I think many people have seen PFS actually cause real problems, where with
> GPFS the issue being talked about is predicated on some theoretical gossip
> bug happening.
>

How many of those were actually caused by incorrect fallback from GPFS to
PFS, rather than PFS itself?


> In the past year at DataStax we have done a lot of testing on 3.0 and 3.11
> around adding nodes, adding DC’s, replacing nodes, replacing racks, and
> replacing DC’s, all while using GPFS, and as far as I know we have not seen
> any “lost” rack/DC information during such testing.
>

I've also run very large GPFS clusters in the past without much gossip
pain, and I'm in the "we should deprecate PFS" camp, but it is also true
that PFS is low maintenance and mostly works. Perhaps the first step is
breaking the GPFS->PFS fallback that people don't know about, maybe that'll
help?

Re: Deprecating/removing PropertyFileSnitch?

Posted by Alexander Dejanovski <al...@thelastpickle.com>.
Hi,

I fully agree that PFS is way too dangerous and makes little (if any) sense
compared to GPFS.
We've had numerous customers that ended up with potential data loss and
fairly complex procedures to recover from several nodes jumping into the
default DC.
Misconfigurations also led to sudden changes of topology which changed
token ownership and require a lot of knowledge to recover from (and even
then, with a reasonable level of uncertainty).

+1 on removing PFS.

Cheers,



On Mon, Oct 29, 2018 at 6:20 PM Jeremy Hanna <je...@gmail.com>
wrote:

>
>
> > On Oct 29, 2018, at 11:20 AM, Jeff Jirsa <jj...@gmail.com> wrote:
> >
> > On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna <jeremy.hanna1234@gmail.com
> >
> > wrote:
> >
> >> Re-reading this thread, it sounds like the issue is there are times
> when a
> >> field may go missing in gossip and it hasn’t yet been tracked down.  As
> >> Jeremiah says, can we get that into a Jira issue with any contextual
> >> information (if there is any)?  However as he says, in theory fields
> going
> >> missing from gossip shouldn’t cause problems for users of GPFS and I
> don’t
> >> believe there have been issues raised in that regard for all of the
> >> clusters out there (including Jeff’s comment about it in this thread).
> >> Testing that more thoroughly could also be a dependent ticket of
> >> deprecating/removing PFS.
> >>
> >>
> > The problem with opening a JIRA now is that it'll look just like 13700
> and
> > the others before it - it'll read something like "status goes missing in
> > large clusters" and the very next time we find a gossip bug, we'll mark
> it
> > as fixed, and it may or may not be the only cause of that bug.
>
> I’ve created a Jira that CASSANDRA-10745 requires for completion to
> thoroughly test the GPFS under such conditions.  See CASSANDRA-14856 <
> https://issues.apache.org/jira/browse/CASSANDRA-14856>
> >
> >
> >> Separately, both Jeff and Sankalp were saying that the fallback was a
> >> problem and there was a flurry of tickets back in 2016 that led to the
> >> original ticket to deprecate the property file snitch.  However,
> >> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
> >> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what
> to
> >> do when deprecating it.  Would people want the functionality between
> GPFS
> >> completely separate from PFS or would people want a mode to emulate it
> >> while using the code for GPFS underneath?
> >>
> >
> > Actually, Jeff was guessing that the class of problems that would make
> you
> > want to deprecate PFS is fallback from GPFS to PFS (because beyond that
> PFS
> > is just stupid easy to use and I can't imagine it's causing a lot of
> > problems for people who know they're using PFS - yes, if you don't update
> > the file, things break, but that's precisely the guarantee of the
> snitch).
>
> My apologies if I had misrepresented, but I’m glad I checked.
>
> What I was originally saying is that PFS has these sharp edges to it - if
> you don’t sync the files for whatever reason, there are problems.  I saw a
> case recently where a team upgraded their machines in one DC and their
> addresses were new in that DC.  They updated the properties file in the DC
> where they upgraded machines but neglected to update the addresses in the
> other DC.  In that case, the nodes in the other DC saw nodes that didn’t
> have any configuration for them and assigned the default configuration as
> per the file option, which was incorrect.  That caused some difficult to
> workaround problems.  All of this could have been avoided had they been
> using the GPFS instead.
>
> So in order to not invite problems such as this for those new to the
> project or and just because there are going to be times when there will be
> configuration mismatches resulting in this sort of behavior (even with
> https://issues.apache.org/jira/browse/CASSANDRA-12681 <
> https://issues.apache.org/jira/browse/CASSANDRA-12681>), I was hoping to
> get consensus on deprecating/removing PFS.
>
> >
> >
> >>
> >>
> >>> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
> >> jeremiah.jordan@gmail.com> wrote:
> >>>
> >>> If you guys are still seeing the problem, would be good to have a JIRA
> >> written up, as all the ones linked were fixed in 2017 and 2015.
> >> CASSANDRA-13700 was found during our testing, and we haven’t seen any
> other
> >> issues since fixing it.
> >>>
> >>> -Jeremiah
> >>>
> >>>> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <ko...@gmail.com>
> >> wrote:
> >>>>
> >>>> No worries...I mentioned the issue not the JIRA number
> >>>>
> >>>>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <
> jeremiah@datastax.com>
> >> wrote:
> >>>>>
> >>>>> Sorry, maybe my spam filter got them or something, but I have never
> >> seen a JIRA number mentioned in the thread before this one.  Just looked
> >> back through again to make sure, and this is the first email I have with
> >> one.
> >>>>>
> >>>>> -Jeremiah
> >>>>>
> >>>>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>> Here are some of the JIRAs which are fixed but actually did not fix
> >> the
> >>>>>> issue. We have tried fixing this by several patches. May be it will
> be
> >>>>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
> >> create a
> >>>>>> new JIRA as this issue still exists.
> >>>>>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
> >>>>>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e=
> >> (related to it)
> >>>>>>
> >>>>>> Also the quote you are using was written as a follow on email. I
> have
> >>>>>> already said what the bug I was referring to.
> >>>>>>
> >>>>>> "Say you restarted all instances in the cluster and status for some
> >> host
> >>>>>> goes missing. Now when you start a host replacement, the new host
> >> won’t
> >>>>>> learn about the host whose status is missing and the view of this
> >> host will
> >>>>>> be wrong."
> >>>>>>
> >>>>>> - CASSANDRA-10366
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <
> kohlisankalp@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I will send the JIRAs of the bug which we thought we have fixed but
> >> it
> >>>>>>> still exists.
> >>>>>>>
> >>>>>>> Have you done any correctness testing after doing all these
> >> tests...have
> >>>>>>> you done the tests for 1000 instance clusters?
> >>>>>>>
> >>>>>>> It is great you have done these tests and I am hoping the gossiping
> >> snitch
> >>>>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am
> >> seeing
> >>>>>>> the bug which is fixed.
> >>>>>>>
> >>>>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <
> >> jeremiah.jordan@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Do you have a specific gossip bug that you have seen recently
> which
> >>>>>>> caused a problem that would make this happen?  Do you have a
> >> specific JIRA
> >>>>>>> in mind?  “We can’t remove this because what if there is a bug”
> >> doesn’t
> >>>>>>> seem like a good enough reason to me. If that was a reason we would
> >> never
> >>>>>>> make any changes to anything.
> >>>>>>>> I think many people have seen PFS actually cause real problems,
> >> where
> >>>>>>> with GPFS the issue being talked about is predicated on some
> >> theoretical
> >>>>>>> gossip bug happening.
> >>>>>>>> In the past year at DataStax we have done a lot of testing on 3.0
> >> and
> >>>>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing
> >> racks,
> >>>>>>> and replacing DC’s, all while using GPFS, and as far as I know we
> >> have not
> >>>>>>> seen any “lost” rack/DC information during such testing.
> >>>>>>>>
> >>>>>>>> -Jeremiah
> >>>>>>>>
> >>>>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <
> kohlisankalp@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> We will have similar issues with Gossip but this will create more
> >>>>>>> issues as
> >>>>>>>>> more things will be relied on Gossip.
> >>>>>>>>>
> >>>>>>>>> I agree PFS should be removed but I dont see how it can be with
> >> issues
> >>>>>>> like
> >>>>>>>>> these or someone proves that it wont cause any issues.
> >>>>>>>>>
> >>>>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <
> >> pauloricardomg@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I can understand keeping PFS for historical/compatibility
> >> reasons, but
> >>>>>>> if
> >>>>>>>>>> gossip is broken I think you will have similar ring view
> problems
> >>>>>>> during
> >>>>>>>>>> replace/bootstrap that would still occur with the use of PFS
> >> (such as
> >>>>>>>>>> missing tokens, since those are propagated via gossip), so that
> >> doesn't
> >>>>>>>>>> seem like a strong reason to keep it around.
> >>>>>>>>>>
> >>>>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if
> you're
> >> not
> >>>>>>>>>> careful enough to have identical files across nodes and updating
> >> it
> >>>>>>> when
> >>>>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
> >>>>>>> snitches.
> >>>>>>>>>> While the rejection of verbs to invalid replicas on trunk could
> >> address
> >>>>>>>>>> concerns raised by Jeremy, this would only happen after the new
> >> node
> >>>>>>> joins
> >>>>>>>>>> the ring, so you would need to re-bootstrap the node and lose
> all
> >> the
> >>>>>>> work
> >>>>>>>>>> done in the original bootstrap.
> >>>>>>>>>>
> >>>>>>>>>> Perhaps one good reason to use PFS is the ability to easily
> >> package it
> >>>>>>>>>> across multiple nodes, as pointed out by Sean Durity on
> >> CASSANDRA-10745
> >>>>>>>>>> (which is also it's Achilles' heel). To keep this ability, we
> >> could
> >>>>>>> make
> >>>>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
> >>>>>>> reading
> >>>>>>>>>> only the dc/rack info about the local node.
> >>>>>>>>>>
> >>>>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
> >>>>>>> kohlisankalp@gmail.com>
> >>>>>>>>>> escreveu:
> >>>>>>>>>>
> >>>>>>>>>>> Yes it will happen. I am worried that same way DC or rack info
> >> can go
> >>>>>>>>>>> missing.
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
> >>>>>>> pauloricardomg@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>> the new host won’t learn about the host whose status is
> >> missing and
> >>>>>>>>>> the
> >>>>>>>>>>>> view of this host will be wrong.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s)
> >> for
> >>>>>>> this
> >>>>>>>>>>>> host will be missing from gossip/system.peers?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
> >>>>>>>>>>> kohlisankalp@gmail.com>
> >>>>>>>>>>>> escreveu:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Say you restarted all instances in the cluster and status for
> >> some
> >>>>>>>>>> host
> >>>>>>>>>>>>> goes missing. Now when you start a host replacement, the new
> >> host
> >>>>>>>>>> won’t
> >>>>>>>>>>>>> learn about the host whose status is missing and the view of
> >> this
> >>>>>>>>>> host
> >>>>>>>>>>>> will
> >>>>>>>>>>>>> be wrong.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start
> >> using
> >>>>>>>>>> Gossip
> >>>>>>>>>>>>> snitch :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
> >>>>>>>>>>> jeremy.hanna1234@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Do you mean to say that during host replacement there may be
> >> a time
> >>>>>>>>>>>> when
> >>>>>>>>>>>>> the old->new host isn’t fully propagated and therefore
> >> wouldn’t yet
> >>>>>>>>>> be
> >>>>>>>>>>> in
> >>>>>>>>>>>>> all system tables?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
> >>>>>>>>>> kohlisankalp@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is not the case during host replacement correct?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> >>>>>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> As long as we are correctly storing such things in the
> >> system
> >>>>>>>>>>> tables
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> reading them out of the system tables when we do not have
> >> the
> >>>>>>>>>>>>> information
> >>>>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I
> >> know
> >>>>>>>>>> GPFS
> >>>>>>>>>>>>> does
> >>>>>>>>>>>>>>>> this, but I have not done extensive code diving or testing
> >> to
> >>>>>>>>>> make
> >>>>>>>>>>>>> sure all
> >>>>>>>>>>>>>>>> edge cases are covered there)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Jeremiah
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
> >>>>>>>>>>> kohlisankalp@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to
> >> Gossip
> >>>>>>>>>> bugs
> >>>>>>>>>>>>> where
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for
> >> large
> >>>>>>>>>>>>>>>>> clusters(~1000 instances)?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <
> >> jjirsa@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
> >>>>>>>>>> invalid
> >>>>>>>>>>>>>>>> replicas
> >>>>>>>>>>>>>>>>>> solves a lot of the concerns here
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>> Jeff Jirsa
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> >>>>>>>>>>>>> jeremy.hanna1234@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even
> >> though
> >>>>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of
> >> what
> >>>>>>>>>> it
> >>>>>>>>>>>>> offers
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected
> >> behaviors
> >>>>>>>>>>> when
> >>>>>>>>>>>>>>>> things
> >>>>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if
> you
> >>>>>>>>>>> replace
> >>>>>>>>>>>>>>>> nodes in
> >>>>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files
> and
> >> not
> >>>>>>>>>> the
> >>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>> DCs
> >>>>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
> >>>>>>>>>>> straightforward
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> troubleshoot.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast
> >> error
> >>>>>>>>>>>> checking
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we
> >> deprecate
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> remove
> >>>>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t
> >> be
> >>>>>>>>>>>>> sufficient
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> replace it?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >>>>>>>>>> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>> For additional commands, e-mail:
> >> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>> For additional commands, e-mail:
> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>
> >>
> >>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeremy Hanna <je...@gmail.com>.

> On Oct 29, 2018, at 11:20 AM, Jeff Jirsa <jj...@gmail.com> wrote:
> 
> On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna <je...@gmail.com>
> wrote:
> 
>> Re-reading this thread, it sounds like the issue is there are times when a
>> field may go missing in gossip and it hasn’t yet been tracked down.  As
>> Jeremiah says, can we get that into a Jira issue with any contextual
>> information (if there is any)?  However as he says, in theory fields going
>> missing from gossip shouldn’t cause problems for users of GPFS and I don’t
>> believe there have been issues raised in that regard for all of the
>> clusters out there (including Jeff’s comment about it in this thread).
>> Testing that more thoroughly could also be a dependent ticket of
>> deprecating/removing PFS.
>> 
>> 
> The problem with opening a JIRA now is that it'll look just like 13700 and
> the others before it - it'll read something like "status goes missing in
> large clusters" and the very next time we find a gossip bug, we'll mark it
> as fixed, and it may or may not be the only cause of that bug.

I’ve created a Jira that CASSANDRA-10745 requires for completion to thoroughly test the GPFS under such conditions.  See CASSANDRA-14856 <https://issues.apache.org/jira/browse/CASSANDRA-14856>
> 
> 
>> Separately, both Jeff and Sankalp were saying that the fallback was a
>> problem and there was a flurry of tickets back in 2016 that led to the
>> original ticket to deprecate the property file snitch.  However,
>> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
>> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to
>> do when deprecating it.  Would people want the functionality between GPFS
>> completely separate from PFS or would people want a mode to emulate it
>> while using the code for GPFS underneath?
>> 
> 
> Actually, Jeff was guessing that the class of problems that would make you
> want to deprecate PFS is fallback from GPFS to PFS (because beyond that PFS
> is just stupid easy to use and I can't imagine it's causing a lot of
> problems for people who know they're using PFS - yes, if you don't update
> the file, things break, but that's precisely the guarantee of the snitch).

My apologies if I had misrepresented, but I’m glad I checked.

What I was originally saying is that PFS has these sharp edges to it - if you don’t sync the files for whatever reason, there are problems.  I saw a case recently where a team upgraded their machines in one DC and their addresses were new in that DC.  They updated the properties file in the DC where they upgraded machines but neglected to update the addresses in the other DC.  In that case, the nodes in the other DC saw nodes that didn’t have any configuration for them and assigned the default configuration as per the file option, which was incorrect.  That caused some difficult to workaround problems.  All of this could have been avoided had they been using the GPFS instead.

So in order to not invite problems such as this for those new to the project or and just because there are going to be times when there will be configuration mismatches resulting in this sort of behavior (even with https://issues.apache.org/jira/browse/CASSANDRA-12681 <https://issues.apache.org/jira/browse/CASSANDRA-12681>), I was hoping to get consensus on deprecating/removing PFS.

> 
> 
>> 
>> 
>>> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
>> jeremiah.jordan@gmail.com> wrote:
>>> 
>>> If you guys are still seeing the problem, would be good to have a JIRA
>> written up, as all the ones linked were fixed in 2017 and 2015.
>> CASSANDRA-13700 was found during our testing, and we haven’t seen any other
>> issues since fixing it.
>>> 
>>> -Jeremiah
>>> 
>>>> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <ko...@gmail.com>
>> wrote:
>>>> 
>>>> No worries...I mentioned the issue not the JIRA number
>>>> 
>>>>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <je...@datastax.com>
>> wrote:
>>>>> 
>>>>> Sorry, maybe my spam filter got them or something, but I have never
>> seen a JIRA number mentioned in the thread before this one.  Just looked
>> back through again to make sure, and this is the first email I have with
>> one.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Here are some of the JIRAs which are fixed but actually did not fix
>> the
>>>>>> issue. We have tried fixing this by several patches. May be it will be
>>>>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
>> create a
>>>>>> new JIRA as this issue still exists.
>>>>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
>>>>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e=
>> (related to it)
>>>>>> 
>>>>>> Also the quote you are using was written as a follow on email. I have
>>>>>> already said what the bug I was referring to.
>>>>>> 
>>>>>> "Say you restarted all instances in the cluster and status for some
>> host
>>>>>> goes missing. Now when you start a host replacement, the new host
>> won’t
>>>>>> learn about the host whose status is missing and the view of this
>> host will
>>>>>> be wrong."
>>>>>> 
>>>>>> - CASSANDRA-10366
>>>>>> 
>>>>>> 
>>>>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <kohlisankalp@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> I will send the JIRAs of the bug which we thought we have fixed but
>> it
>>>>>>> still exists.
>>>>>>> 
>>>>>>> Have you done any correctness testing after doing all these
>> tests...have
>>>>>>> you done the tests for 1000 instance clusters?
>>>>>>> 
>>>>>>> It is great you have done these tests and I am hoping the gossiping
>> snitch
>>>>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am
>> seeing
>>>>>>> the bug which is fixed.
>>>>>>> 
>>>>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <
>> jeremiah.jordan@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Do you have a specific gossip bug that you have seen recently which
>>>>>>> caused a problem that would make this happen?  Do you have a
>> specific JIRA
>>>>>>> in mind?  “We can’t remove this because what if there is a bug”
>> doesn’t
>>>>>>> seem like a good enough reason to me. If that was a reason we would
>> never
>>>>>>> make any changes to anything.
>>>>>>>> I think many people have seen PFS actually cause real problems,
>> where
>>>>>>> with GPFS the issue being talked about is predicated on some
>> theoretical
>>>>>>> gossip bug happening.
>>>>>>>> In the past year at DataStax we have done a lot of testing on 3.0
>> and
>>>>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing
>> racks,
>>>>>>> and replacing DC’s, all while using GPFS, and as far as I know we
>> have not
>>>>>>> seen any “lost” rack/DC information during such testing.
>>>>>>>> 
>>>>>>>> -Jeremiah
>>>>>>>> 
>>>>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <kohlisankalp@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> We will have similar issues with Gossip but this will create more
>>>>>>> issues as
>>>>>>>>> more things will be relied on Gossip.
>>>>>>>>> 
>>>>>>>>> I agree PFS should be removed but I dont see how it can be with
>> issues
>>>>>>> like
>>>>>>>>> these or someone proves that it wont cause any issues.
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <
>> pauloricardomg@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I can understand keeping PFS for historical/compatibility
>> reasons, but
>>>>>>> if
>>>>>>>>>> gossip is broken I think you will have similar ring view problems
>>>>>>> during
>>>>>>>>>> replace/bootstrap that would still occur with the use of PFS
>> (such as
>>>>>>>>>> missing tokens, since those are propagated via gossip), so that
>> doesn't
>>>>>>>>>> seem like a strong reason to keep it around.
>>>>>>>>>> 
>>>>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're
>> not
>>>>>>>>>> careful enough to have identical files across nodes and updating
>> it
>>>>>>> when
>>>>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>>>>>>> snitches.
>>>>>>>>>> While the rejection of verbs to invalid replicas on trunk could
>> address
>>>>>>>>>> concerns raised by Jeremy, this would only happen after the new
>> node
>>>>>>> joins
>>>>>>>>>> the ring, so you would need to re-bootstrap the node and lose all
>> the
>>>>>>> work
>>>>>>>>>> done in the original bootstrap.
>>>>>>>>>> 
>>>>>>>>>> Perhaps one good reason to use PFS is the ability to easily
>> package it
>>>>>>>>>> across multiple nodes, as pointed out by Sean Durity on
>> CASSANDRA-10745
>>>>>>>>>> (which is also it's Achilles' heel). To keep this ability, we
>> could
>>>>>>> make
>>>>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
>>>>>>> reading
>>>>>>>>>> only the dc/rack info about the local node.
>>>>>>>>>> 
>>>>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>> escreveu:
>>>>>>>>>> 
>>>>>>>>>>> Yes it will happen. I am worried that same way DC or rack info
>> can go
>>>>>>>>>>> missing.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>>>>>>> pauloricardomg@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>>> the new host won’t learn about the host whose status is
>> missing and
>>>>>>>>>> the
>>>>>>>>>>>> view of this host will be wrong.
>>>>>>>>>>>> 
>>>>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s)
>> for
>>>>>>> this
>>>>>>>>>>>> host will be missing from gossip/system.peers?
>>>>>>>>>>>> 
>>>>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>>>> escreveu:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Say you restarted all instances in the cluster and status for
>> some
>>>>>>>>>> host
>>>>>>>>>>>>> goes missing. Now when you start a host replacement, the new
>> host
>>>>>>>>>> won’t
>>>>>>>>>>>>> learn about the host whose status is missing and the view of
>> this
>>>>>>>>>> host
>>>>>>>>>>>> will
>>>>>>>>>>>>> be wrong.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start
>> using
>>>>>>>>>> Gossip
>>>>>>>>>>>>> snitch :)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Do you mean to say that during host replacement there may be
>> a time
>>>>>>>>>>>> when
>>>>>>>>>>>>> the old->new host isn’t fully propagated and therefore
>> wouldn’t yet
>>>>>>>>>> be
>>>>>>>>>>> in
>>>>>>>>>>>>> all system tables?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As long as we are correctly storing such things in the
>> system
>>>>>>>>>>> tables
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> reading them out of the system tables when we do not have
>> the
>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I
>> know
>>>>>>>>>> GPFS
>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>> this, but I have not done extensive code diving or testing
>> to
>>>>>>>>>> make
>>>>>>>>>>>>> sure all
>>>>>>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -Jeremiah
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>>>>>>> kohlisankalp@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to
>> Gossip
>>>>>>>>>> bugs
>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for
>> large
>>>>>>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <
>> jjirsa@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>>>>>>> invalid
>>>>>>>>>>>>>>>> replicas
>>>>>>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even
>> though
>>>>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of
>> what
>>>>>>>>>> it
>>>>>>>>>>>>> offers
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected
>> behaviors
>>>>>>>>>>> when
>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>>>>>>> replace
>>>>>>>>>>>>>>>> nodes in
>>>>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and
>> not
>>>>>>>>>> the
>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> DCs
>>>>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>>>>>>> straightforward
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast
>> error
>>>>>>>>>>>> checking
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we
>> deprecate
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t
>> be
>>>>>>>>>>>>> sufficient
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>> 
>> 


Re: Deprecating/removing PropertyFileSnitch?

Posted by "J. D. Jordan" <je...@gmail.com>.
The place people get in trouble with PFS is that the example file has a “default” setting in it, which people fill out because it is there. Later down the road they typo/mess up updating the file when they add nodes in a different DC than the default, and oops, stuff is messed up.  That and GPFS fallback.

So can we all agree to rename the PFS example file so that someone has to copy/rename it to make it valid (to fix GPFS fallback issues) and remove the example from the file of having a “default” rack/dc set?  If we did those two things I think it would go a long way towards fixing PFS issues.

-Jeremiah

> On Oct 29, 2018, at 11:20 AM, Jeff Jirsa <jj...@gmail.com> wrote:
> 
> On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna <je...@gmail.com>
> wrote:
> 
>> Re-reading this thread, it sounds like the issue is there are times when a
>> field may go missing in gossip and it hasn’t yet been tracked down.  As
>> Jeremiah says, can we get that into a Jira issue with any contextual
>> information (if there is any)?  However as he says, in theory fields going
>> missing from gossip shouldn’t cause problems for users of GPFS and I don’t
>> believe there have been issues raised in that regard for all of the
>> clusters out there (including Jeff’s comment about it in this thread).
>> Testing that more thoroughly could also be a dependent ticket of
>> deprecating/removing PFS.
>> 
>> 
> The problem with opening a JIRA now is that it'll look just like 13700 and
> the others before it - it'll read something like "status goes missing in
> large clusters" and the very next time we find a gossip bug, we'll mark it
> as fixed, and it may or may not be the only cause of that bug.
> 
> 
> 
>> Separately, both Jeff and Sankalp were saying that the fallback was a
>> problem and there was a flurry of tickets back in 2016 that led to the
>> original ticket to deprecate the property file snitch.  However,
>> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
>> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to
>> do when deprecating it.  Would people want the functionality between GPFS
>> completely separate from PFS or would people want a mode to emulate it
>> while using the code for GPFS underneath?
>> 
> 
> Actually, Jeff was guessing that the class of problems that would make you
> want to deprecate PFS is fallback from GPFS to PFS (because beyond that PFS
> is just stupid easy to use and I can't imagine it's causing a lot of
> problems for people who know they're using PFS - yes, if you don't update
> the file, things break, but that's precisely the guarantee of the snitch).
> 
> 
>> 
>> 
>>> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
>> jeremiah.jordan@gmail.com> wrote:
>>> 
>>> If you guys are still seeing the problem, would be good to have a JIRA
>> written up, as all the ones linked were fixed in 2017 and 2015.
>> CASSANDRA-13700 was found during our testing, and we haven’t seen any other
>> issues since fixing it.
>>> 
>>> -Jeremiah
>>> 
>>>> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <ko...@gmail.com>
>> wrote:
>>>> 
>>>> No worries...I mentioned the issue not the JIRA number
>>>> 
>>>>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <je...@datastax.com>
>> wrote:
>>>>> 
>>>>> Sorry, maybe my spam filter got them or something, but I have never
>> seen a JIRA number mentioned in the thread before this one.  Just looked
>> back through again to make sure, and this is the first email I have with
>> one.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Here are some of the JIRAs which are fixed but actually did not fix
>> the
>>>>>> issue. We have tried fixing this by several patches. May be it will be
>>>>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
>> create a
>>>>>> new JIRA as this issue still exists.
>>>>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
>>>>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e=
>> (related to it)
>>>>>> 
>>>>>> Also the quote you are using was written as a follow on email. I have
>>>>>> already said what the bug I was referring to.
>>>>>> 
>>>>>> "Say you restarted all instances in the cluster and status for some
>> host
>>>>>> goes missing. Now when you start a host replacement, the new host
>> won’t
>>>>>> learn about the host whose status is missing and the view of this
>> host will
>>>>>> be wrong."
>>>>>> 
>>>>>> - CASSANDRA-10366
>>>>>> 
>>>>>> 
>>>>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <kohlisankalp@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> I will send the JIRAs of the bug which we thought we have fixed but
>> it
>>>>>>> still exists.
>>>>>>> 
>>>>>>> Have you done any correctness testing after doing all these
>> tests...have
>>>>>>> you done the tests for 1000 instance clusters?
>>>>>>> 
>>>>>>> It is great you have done these tests and I am hoping the gossiping
>> snitch
>>>>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am
>> seeing
>>>>>>> the bug which is fixed.
>>>>>>> 
>>>>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <
>> jeremiah.jordan@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Do you have a specific gossip bug that you have seen recently which
>>>>>>> caused a problem that would make this happen?  Do you have a
>> specific JIRA
>>>>>>> in mind?  “We can’t remove this because what if there is a bug”
>> doesn’t
>>>>>>> seem like a good enough reason to me. If that was a reason we would
>> never
>>>>>>> make any changes to anything.
>>>>>>>> I think many people have seen PFS actually cause real problems,
>> where
>>>>>>> with GPFS the issue being talked about is predicated on some
>> theoretical
>>>>>>> gossip bug happening.
>>>>>>>> In the past year at DataStax we have done a lot of testing on 3.0
>> and
>>>>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing
>> racks,
>>>>>>> and replacing DC’s, all while using GPFS, and as far as I know we
>> have not
>>>>>>> seen any “lost” rack/DC information during such testing.
>>>>>>>> 
>>>>>>>> -Jeremiah
>>>>>>>> 
>>>>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <kohlisankalp@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> We will have similar issues with Gossip but this will create more
>>>>>>> issues as
>>>>>>>>> more things will be relied on Gossip.
>>>>>>>>> 
>>>>>>>>> I agree PFS should be removed but I dont see how it can be with
>> issues
>>>>>>> like
>>>>>>>>> these or someone proves that it wont cause any issues.
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <
>> pauloricardomg@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I can understand keeping PFS for historical/compatibility
>> reasons, but
>>>>>>> if
>>>>>>>>>> gossip is broken I think you will have similar ring view problems
>>>>>>> during
>>>>>>>>>> replace/bootstrap that would still occur with the use of PFS
>> (such as
>>>>>>>>>> missing tokens, since those are propagated via gossip), so that
>> doesn't
>>>>>>>>>> seem like a strong reason to keep it around.
>>>>>>>>>> 
>>>>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're
>> not
>>>>>>>>>> careful enough to have identical files across nodes and updating
>> it
>>>>>>> when
>>>>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>>>>>>> snitches.
>>>>>>>>>> While the rejection of verbs to invalid replicas on trunk could
>> address
>>>>>>>>>> concerns raised by Jeremy, this would only happen after the new
>> node
>>>>>>> joins
>>>>>>>>>> the ring, so you would need to re-bootstrap the node and lose all
>> the
>>>>>>> work
>>>>>>>>>> done in the original bootstrap.
>>>>>>>>>> 
>>>>>>>>>> Perhaps one good reason to use PFS is the ability to easily
>> package it
>>>>>>>>>> across multiple nodes, as pointed out by Sean Durity on
>> CASSANDRA-10745
>>>>>>>>>> (which is also it's Achilles' heel). To keep this ability, we
>> could
>>>>>>> make
>>>>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
>>>>>>> reading
>>>>>>>>>> only the dc/rack info about the local node.
>>>>>>>>>> 
>>>>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>> escreveu:
>>>>>>>>>> 
>>>>>>>>>>> Yes it will happen. I am worried that same way DC or rack info
>> can go
>>>>>>>>>>> missing.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>>>>>>> pauloricardomg@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>>> the new host won’t learn about the host whose status is
>> missing and
>>>>>>>>>> the
>>>>>>>>>>>> view of this host will be wrong.
>>>>>>>>>>>> 
>>>>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s)
>> for
>>>>>>> this
>>>>>>>>>>>> host will be missing from gossip/system.peers?
>>>>>>>>>>>> 
>>>>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>>>> escreveu:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Say you restarted all instances in the cluster and status for
>> some
>>>>>>>>>> host
>>>>>>>>>>>>> goes missing. Now when you start a host replacement, the new
>> host
>>>>>>>>>> won’t
>>>>>>>>>>>>> learn about the host whose status is missing and the view of
>> this
>>>>>>>>>> host
>>>>>>>>>>>> will
>>>>>>>>>>>>> be wrong.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start
>> using
>>>>>>>>>> Gossip
>>>>>>>>>>>>> snitch :)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Do you mean to say that during host replacement there may be
>> a time
>>>>>>>>>>>> when
>>>>>>>>>>>>> the old->new host isn’t fully propagated and therefore
>> wouldn’t yet
>>>>>>>>>> be
>>>>>>>>>>> in
>>>>>>>>>>>>> all system tables?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As long as we are correctly storing such things in the
>> system
>>>>>>>>>>> tables
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> reading them out of the system tables when we do not have
>> the
>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I
>> know
>>>>>>>>>> GPFS
>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>> this, but I have not done extensive code diving or testing
>> to
>>>>>>>>>> make
>>>>>>>>>>>>> sure all
>>>>>>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -Jeremiah
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>>>>>>> kohlisankalp@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to
>> Gossip
>>>>>>>>>> bugs
>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for
>> large
>>>>>>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <
>> jjirsa@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>>>>>>> invalid
>>>>>>>>>>>>>>>> replicas
>>>>>>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even
>> though
>>>>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of
>> what
>>>>>>>>>> it
>>>>>>>>>>>>> offers
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected
>> behaviors
>>>>>>>>>>> when
>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>>>>>>> replace
>>>>>>>>>>>>>>>> nodes in
>>>>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and
>> not
>>>>>>>>>> the
>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> DCs
>>>>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>>>>>>> straightforward
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast
>> error
>>>>>>>>>>>> checking
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we
>> deprecate
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t
>> be
>>>>>>>>>>>>> sufficient
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeff Jirsa <jj...@gmail.com>.
On Mon, Oct 29, 2018 at 8:35 AM Jeremy Hanna <je...@gmail.com>
wrote:

> Re-reading this thread, it sounds like the issue is there are times when a
> field may go missing in gossip and it hasn’t yet been tracked down.  As
> Jeremiah says, can we get that into a Jira issue with any contextual
> information (if there is any)?  However as he says, in theory fields going
> missing from gossip shouldn’t cause problems for users of GPFS and I don’t
> believe there have been issues raised in that regard for all of the
> clusters out there (including Jeff’s comment about it in this thread).
> Testing that more thoroughly could also be a dependent ticket of
> deprecating/removing PFS.
>
>
The problem with opening a JIRA now is that it'll look just like 13700 and
the others before it - it'll read something like "status goes missing in
large clusters" and the very next time we find a gossip bug, we'll mark it
as fixed, and it may or may not be the only cause of that bug.



> Separately, both Jeff and Sankalp were saying that the fallback was a
> problem and there was a flurry of tickets back in 2016 that led to the
> original ticket to deprecate the property file snitch.  However,
> https://issues.apache.org/jira/browse/CASSANDRA-10745 <
> https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to
> do when deprecating it.  Would people want the functionality between GPFS
> completely separate from PFS or would people want a mode to emulate it
> while using the code for GPFS underneath?
>

Actually, Jeff was guessing that the class of problems that would make you
want to deprecate PFS is fallback from GPFS to PFS (because beyond that PFS
is just stupid easy to use and I can't imagine it's causing a lot of
problems for people who know they're using PFS - yes, if you don't update
the file, things break, but that's precisely the guarantee of the snitch).


>
>
> > On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <
> jeremiah.jordan@gmail.com> wrote:
> >
> > If you guys are still seeing the problem, would be good to have a JIRA
> written up, as all the ones linked were fixed in 2017 and 2015.
> CASSANDRA-13700 was found during our testing, and we haven’t seen any other
> issues since fixing it.
> >
> > -Jeremiah
> >
> >> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <ko...@gmail.com>
> wrote:
> >>
> >> No worries...I mentioned the issue not the JIRA number
> >>
> >>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <je...@datastax.com>
> wrote:
> >>>
> >>> Sorry, maybe my spam filter got them or something, but I have never
> seen a JIRA number mentioned in the thread before this one.  Just looked
> back through again to make sure, and this is the first email I have with
> one.
> >>>
> >>> -Jeremiah
> >>>
> >>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com>
> wrote:
> >>>>
> >>>> Here are some of the JIRAs which are fixed but actually did not fix
> the
> >>>> issue. We have tried fixing this by several patches. May be it will be
> >>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or
> create a
> >>>> new JIRA as this issue still exists.
> >>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
> >>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e=
> (related to it)
> >>>>
> >>>> Also the quote you are using was written as a follow on email. I have
> >>>> already said what the bug I was referring to.
> >>>>
> >>>> "Say you restarted all instances in the cluster and status for some
> host
> >>>> goes missing. Now when you start a host replacement, the new host
> won’t
> >>>> learn about the host whose status is missing and the view of this
> host will
> >>>> be wrong."
> >>>>
> >>>> - CASSANDRA-10366
> >>>>
> >>>>
> >>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <kohlisankalp@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> I will send the JIRAs of the bug which we thought we have fixed but
> it
> >>>>> still exists.
> >>>>>
> >>>>> Have you done any correctness testing after doing all these
> tests...have
> >>>>> you done the tests for 1000 instance clusters?
> >>>>>
> >>>>> It is great you have done these tests and I am hoping the gossiping
> snitch
> >>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am
> seeing
> >>>>> the bug which is fixed.
> >>>>>
> >>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <
> jeremiah.jordan@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Do you have a specific gossip bug that you have seen recently which
> >>>>> caused a problem that would make this happen?  Do you have a
> specific JIRA
> >>>>> in mind?  “We can’t remove this because what if there is a bug”
> doesn’t
> >>>>> seem like a good enough reason to me. If that was a reason we would
> never
> >>>>> make any changes to anything.
> >>>>>> I think many people have seen PFS actually cause real problems,
> where
> >>>>> with GPFS the issue being talked about is predicated on some
> theoretical
> >>>>> gossip bug happening.
> >>>>>> In the past year at DataStax we have done a lot of testing on 3.0
> and
> >>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing
> racks,
> >>>>> and replacing DC’s, all while using GPFS, and as far as I know we
> have not
> >>>>> seen any “lost” rack/DC information during such testing.
> >>>>>>
> >>>>>> -Jeremiah
> >>>>>>
> >>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <kohlisankalp@gmail.com
> >
> >>>>> wrote:
> >>>>>>>
> >>>>>>> We will have similar issues with Gossip but this will create more
> >>>>> issues as
> >>>>>>> more things will be relied on Gossip.
> >>>>>>>
> >>>>>>> I agree PFS should be removed but I dont see how it can be with
> issues
> >>>>> like
> >>>>>>> these or someone proves that it wont cause any issues.
> >>>>>>>
> >>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <
> pauloricardomg@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I can understand keeping PFS for historical/compatibility
> reasons, but
> >>>>> if
> >>>>>>>> gossip is broken I think you will have similar ring view problems
> >>>>> during
> >>>>>>>> replace/bootstrap that would still occur with the use of PFS
> (such as
> >>>>>>>> missing tokens, since those are propagated via gossip), so that
> doesn't
> >>>>>>>> seem like a strong reason to keep it around.
> >>>>>>>>
> >>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're
> not
> >>>>>>>> careful enough to have identical files across nodes and updating
> it
> >>>>> when
> >>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
> >>>>> snitches.
> >>>>>>>> While the rejection of verbs to invalid replicas on trunk could
> address
> >>>>>>>> concerns raised by Jeremy, this would only happen after the new
> node
> >>>>> joins
> >>>>>>>> the ring, so you would need to re-bootstrap the node and lose all
> the
> >>>>> work
> >>>>>>>> done in the original bootstrap.
> >>>>>>>>
> >>>>>>>> Perhaps one good reason to use PFS is the ability to easily
> package it
> >>>>>>>> across multiple nodes, as pointed out by Sean Durity on
> CASSANDRA-10745
> >>>>>>>> (which is also it's Achilles' heel). To keep this ability, we
> could
> >>>>> make
> >>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
> >>>>> reading
> >>>>>>>> only the dc/rack info about the local node.
> >>>>>>>>
> >>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
> >>>>> kohlisankalp@gmail.com>
> >>>>>>>> escreveu:
> >>>>>>>>
> >>>>>>>>> Yes it will happen. I am worried that same way DC or rack info
> can go
> >>>>>>>>> missing.
> >>>>>>>>>
> >>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
> >>>>> pauloricardomg@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>>> the new host won’t learn about the host whose status is
> missing and
> >>>>>>>> the
> >>>>>>>>>> view of this host will be wrong.
> >>>>>>>>>>
> >>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s)
> for
> >>>>> this
> >>>>>>>>>> host will be missing from gossip/system.peers?
> >>>>>>>>>>
> >>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
> >>>>>>>>> kohlisankalp@gmail.com>
> >>>>>>>>>> escreveu:
> >>>>>>>>>>
> >>>>>>>>>>> Say you restarted all instances in the cluster and status for
> some
> >>>>>>>> host
> >>>>>>>>>>> goes missing. Now when you start a host replacement, the new
> host
> >>>>>>>> won’t
> >>>>>>>>>>> learn about the host whose status is missing and the view of
> this
> >>>>>>>> host
> >>>>>>>>>> will
> >>>>>>>>>>> be wrong.
> >>>>>>>>>>>
> >>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start
> using
> >>>>>>>> Gossip
> >>>>>>>>>>> snitch :)
> >>>>>>>>>>>
> >>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
> >>>>>>>>> jeremy.hanna1234@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do you mean to say that during host replacement there may be
> a time
> >>>>>>>>>> when
> >>>>>>>>>>> the old->new host isn’t fully propagated and therefore
> wouldn’t yet
> >>>>>>>> be
> >>>>>>>>> in
> >>>>>>>>>>> all system tables?
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
> >>>>>>>> kohlisankalp@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is not the case during host replacement correct?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> >>>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> As long as we are correctly storing such things in the
> system
> >>>>>>>>> tables
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> reading them out of the system tables when we do not have
> the
> >>>>>>>>>>> information
> >>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I
> know
> >>>>>>>> GPFS
> >>>>>>>>>>> does
> >>>>>>>>>>>>>> this, but I have not done extensive code diving or testing
> to
> >>>>>>>> make
> >>>>>>>>>>> sure all
> >>>>>>>>>>>>>> edge cases are covered there)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jeremiah
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
> >>>>>>>>> kohlisankalp@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to
> Gossip
> >>>>>>>> bugs
> >>>>>>>>>>> where
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for
> large
> >>>>>>>>>>>>>>> clusters(~1000 instances)?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <
> jjirsa@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
> >>>>>>>> invalid
> >>>>>>>>>>>>>> replicas
> >>>>>>>>>>>>>>>> solves a lot of the concerns here
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Jeff Jirsa
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> >>>>>>>>>>> jeremy.hanna1234@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even
> though
> >>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of
> what
> >>>>>>>> it
> >>>>>>>>>>> offers
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected
> behaviors
> >>>>>>>>> when
> >>>>>>>>>>>>>> things
> >>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
> >>>>>>>>> replace
> >>>>>>>>>>>>>> nodes in
> >>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and
> not
> >>>>>>>> the
> >>>>>>>>>>> other
> >>>>>>>>>>>>>> DCs
> >>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
> >>>>>>>>> straightforward
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>>> troubleshoot.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast
> error
> >>>>>>>>>> checking
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we
> deprecate
> >>>>>>>>> and
> >>>>>>>>>>>>>> remove
> >>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t
> be
> >>>>>>>>>>> sufficient
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> replace it?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >>>>>>>> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>>>> For additional commands, e-mail:
> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>>>> For additional commands, e-mail:
> dev-help@cassandra.apache.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>> For additional commands, e-mail:
> dev-help@cassandra.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>
> >>>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeremy Hanna <je...@gmail.com>.
Re-reading this thread, it sounds like the issue is there are times when a field may go missing in gossip and it hasn’t yet been tracked down.  As Jeremiah says, can we get that into a Jira issue with any contextual information (if there is any)?  However as he says, in theory fields going missing from gossip shouldn’t cause problems for users of GPFS and I don’t believe there have been issues raised in that regard for all of the clusters out there (including Jeff’s comment about it in this thread).  Testing that more thoroughly could also be a dependent ticket of deprecating/removing PFS.

Separately, both Jeff and Sankalp were saying that the fallback was a problem and there was a flurry of tickets back in 2016 that led to the original ticket to deprecate the property file snitch.  However, https://issues.apache.org/jira/browse/CASSANDRA-10745 <https://issues.apache.org/jira/browse/CASSANDRA-10745> discusses what to do when deprecating it.  Would people want the functionality between GPFS completely separate from PFS or would people want a mode to emulate it while using the code for GPFS underneath?


> On Oct 22, 2018, at 10:33 PM, Jeremiah D Jordan <je...@gmail.com> wrote:
> 
> If you guys are still seeing the problem, would be good to have a JIRA written up, as all the ones linked were fixed in 2017 and 2015.  CASSANDRA-13700 was found during our testing, and we haven’t seen any other issues since fixing it.
> 
> -Jeremiah
> 
>> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <ko...@gmail.com> wrote:
>> 
>> No worries...I mentioned the issue not the JIRA number 
>> 
>>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <je...@datastax.com> wrote:
>>> 
>>> Sorry, maybe my spam filter got them or something, but I have never seen a JIRA number mentioned in the thread before this one.  Just looked back through again to make sure, and this is the first email I have with one.
>>> 
>>> -Jeremiah
>>> 
>>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com> wrote:
>>>> 
>>>> Here are some of the JIRAs which are fixed but actually did not fix the
>>>> issue. We have tried fixing this by several patches. May be it will be
>>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
>>>> new JIRA as this issue still exists.
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= (related to it)
>>>> 
>>>> Also the quote you are using was written as a follow on email. I have
>>>> already said what the bug I was referring to.
>>>> 
>>>> "Say you restarted all instances in the cluster and status for some host
>>>> goes missing. Now when you start a host replacement, the new host won’t
>>>> learn about the host whose status is missing and the view of this host will
>>>> be wrong."
>>>> 
>>>> - CASSANDRA-10366
>>>> 
>>>> 
>>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <ko...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I will send the JIRAs of the bug which we thought we have fixed but it
>>>>> still exists.
>>>>> 
>>>>> Have you done any correctness testing after doing all these tests...have
>>>>> you done the tests for 1000 instance clusters?
>>>>> 
>>>>> It is great you have done these tests and I am hoping the gossiping snitch
>>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
>>>>> the bug which is fixed.
>>>>> 
>>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <je...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Do you have a specific gossip bug that you have seen recently which
>>>>> caused a problem that would make this happen?  Do you have a specific JIRA
>>>>> in mind?  “We can’t remove this because what if there is a bug” doesn’t
>>>>> seem like a good enough reason to me. If that was a reason we would never
>>>>> make any changes to anything.
>>>>>> I think many people have seen PFS actually cause real problems, where
>>>>> with GPFS the issue being talked about is predicated on some theoretical
>>>>> gossip bug happening.
>>>>>> In the past year at DataStax we have done a lot of testing on 3.0 and
>>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
>>>>> and replacing DC’s, all while using GPFS, and as far as I know we have not
>>>>> seen any “lost” rack/DC information during such testing.
>>>>>> 
>>>>>> -Jeremiah
>>>>>> 
>>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> We will have similar issues with Gossip but this will create more
>>>>> issues as
>>>>>>> more things will be relied on Gossip.
>>>>>>> 
>>>>>>> I agree PFS should be removed but I dont see how it can be with issues
>>>>> like
>>>>>>> these or someone proves that it wont cause any issues.
>>>>>>> 
>>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I can understand keeping PFS for historical/compatibility reasons, but
>>>>> if
>>>>>>>> gossip is broken I think you will have similar ring view problems
>>>>> during
>>>>>>>> replace/bootstrap that would still occur with the use of PFS (such as
>>>>>>>> missing tokens, since those are propagated via gossip), so that doesn't
>>>>>>>> seem like a strong reason to keep it around.
>>>>>>>> 
>>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>>>>>>> careful enough to have identical files across nodes and updating it
>>>>> when
>>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>>>>> snitches.
>>>>>>>> While the rejection of verbs to invalid replicas on trunk could address
>>>>>>>> concerns raised by Jeremy, this would only happen after the new node
>>>>> joins
>>>>>>>> the ring, so you would need to re-bootstrap the node and lose all the
>>>>> work
>>>>>>>> done in the original bootstrap.
>>>>>>>> 
>>>>>>>> Perhaps one good reason to use PFS is the ability to easily package it
>>>>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>>>>>>> (which is also it's Achilles' heel). To keep this ability, we could
>>>>> make
>>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
>>>>> reading
>>>>>>>> only the dc/rack info about the local node.
>>>>>>>> 
>>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>>>>> kohlisankalp@gmail.com>
>>>>>>>> escreveu:
>>>>>>>> 
>>>>>>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>>>>>>> missing.
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>>>>> pauloricardomg@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>>> the new host won’t learn about the host whose status is missing and
>>>>>>>> the
>>>>>>>>>> view of this host will be wrong.
>>>>>>>>>> 
>>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
>>>>> this
>>>>>>>>>> host will be missing from gossip/system.peers?
>>>>>>>>>> 
>>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>> escreveu:
>>>>>>>>>> 
>>>>>>>>>>> Say you restarted all instances in the cluster and status for some
>>>>>>>> host
>>>>>>>>>>> goes missing. Now when you start a host replacement, the new host
>>>>>>>> won’t
>>>>>>>>>>> learn about the host whose status is missing and the view of this
>>>>>>>> host
>>>>>>>>>> will
>>>>>>>>>>> be wrong.
>>>>>>>>>>> 
>>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>>>>>>> Gossip
>>>>>>>>>>> snitch :)
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>>>>>>> when
>>>>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>>>>>>> be
>>>>>>>>> in
>>>>>>>>>>> all system tables?
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As long as we are correctly storing such things in the system
>>>>>>>>> tables
>>>>>>>>>>> and
>>>>>>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>>>>>>> information
>>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>>>>>>> GPFS
>>>>>>>>>>> does
>>>>>>>>>>>>>> this, but I have not done extensive code diving or testing to
>>>>>>>> make
>>>>>>>>>>> sure all
>>>>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -Jeremiah
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>>>>> kohlisankalp@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>>>>>>> bugs
>>>>>>>>>>> where
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>>>>> invalid
>>>>>>>>>>>>>> replicas
>>>>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>>>>>>> it
>>>>>>>>>>> offers
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>>>>>>> when
>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>>>>> replace
>>>>>>>>>>>>>> nodes in
>>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>>>>>>> the
>>>>>>>>>>> other
>>>>>>>>>>>>>> DCs
>>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>>>>> straightforward
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>>>>>>> checking
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>>>>>>> and
>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>>>>>>> sufficient
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>> 
>>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeremiah D Jordan <je...@gmail.com>.
If you guys are still seeing the problem, would be good to have a JIRA written up, as all the ones linked were fixed in 2017 and 2015.  CASSANDRA-13700 was found during our testing, and we haven’t seen any other issues since fixing it.

-Jeremiah

> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <ko...@gmail.com> wrote:
> 
> No worries...I mentioned the issue not the JIRA number 
> 
>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <je...@datastax.com> wrote:
>> 
>> Sorry, maybe my spam filter got them or something, but I have never seen a JIRA number mentioned in the thread before this one.  Just looked back through again to make sure, and this is the first email I have with one.
>> 
>> -Jeremiah
>> 
>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com> wrote:
>>> 
>>> Here are some of the JIRAs which are fixed but actually did not fix the
>>> issue. We have tried fixing this by several patches. May be it will be
>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
>>> new JIRA as this issue still exists.
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= (related to it)
>>> 
>>> Also the quote you are using was written as a follow on email. I have
>>> already said what the bug I was referring to.
>>> 
>>> "Say you restarted all instances in the cluster and status for some host
>>> goes missing. Now when you start a host replacement, the new host won’t
>>> learn about the host whose status is missing and the view of this host will
>>> be wrong."
>>> 
>>> - CASSANDRA-10366
>>> 
>>> 
>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <ko...@gmail.com>
>>> wrote:
>>> 
>>>> I will send the JIRAs of the bug which we thought we have fixed but it
>>>> still exists.
>>>> 
>>>> Have you done any correctness testing after doing all these tests...have
>>>> you done the tests for 1000 instance clusters?
>>>> 
>>>> It is great you have done these tests and I am hoping the gossiping snitch
>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
>>>> the bug which is fixed.
>>>> 
>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <je...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Do you have a specific gossip bug that you have seen recently which
>>>> caused a problem that would make this happen?  Do you have a specific JIRA
>>>> in mind?  “We can’t remove this because what if there is a bug” doesn’t
>>>> seem like a good enough reason to me. If that was a reason we would never
>>>> make any changes to anything.
>>>>> I think many people have seen PFS actually cause real problems, where
>>>> with GPFS the issue being talked about is predicated on some theoretical
>>>> gossip bug happening.
>>>>> In the past year at DataStax we have done a lot of testing on 3.0 and
>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
>>>> and replacing DC’s, all while using GPFS, and as far as I know we have not
>>>> seen any “lost” rack/DC information during such testing.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> We will have similar issues with Gossip but this will create more
>>>> issues as
>>>>>> more things will be relied on Gossip.
>>>>>> 
>>>>>> I agree PFS should be removed but I dont see how it can be with issues
>>>> like
>>>>>> these or someone proves that it wont cause any issues.
>>>>>> 
>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I can understand keeping PFS for historical/compatibility reasons, but
>>>> if
>>>>>>> gossip is broken I think you will have similar ring view problems
>>>> during
>>>>>>> replace/bootstrap that would still occur with the use of PFS (such as
>>>>>>> missing tokens, since those are propagated via gossip), so that doesn't
>>>>>>> seem like a strong reason to keep it around.
>>>>>>> 
>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>>>>>> careful enough to have identical files across nodes and updating it
>>>> when
>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>>>> snitches.
>>>>>>> While the rejection of verbs to invalid replicas on trunk could address
>>>>>>> concerns raised by Jeremy, this would only happen after the new node
>>>> joins
>>>>>>> the ring, so you would need to re-bootstrap the node and lose all the
>>>> work
>>>>>>> done in the original bootstrap.
>>>>>>> 
>>>>>>> Perhaps one good reason to use PFS is the ability to easily package it
>>>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>>>>>> (which is also it's Achilles' heel). To keep this ability, we could
>>>> make
>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
>>>> reading
>>>>>>> only the dc/rack info about the local node.
>>>>>>> 
>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>>>> kohlisankalp@gmail.com>
>>>>>>> escreveu:
>>>>>>> 
>>>>>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>>>>>> missing.
>>>>>>>> 
>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>>>> pauloricardomg@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>>> the new host won’t learn about the host whose status is missing and
>>>>>>> the
>>>>>>>>> view of this host will be wrong.
>>>>>>>>> 
>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
>>>> this
>>>>>>>>> host will be missing from gossip/system.peers?
>>>>>>>>> 
>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>> escreveu:
>>>>>>>>> 
>>>>>>>>>> Say you restarted all instances in the cluster and status for some
>>>>>>> host
>>>>>>>>>> goes missing. Now when you start a host replacement, the new host
>>>>>>> won’t
>>>>>>>>>> learn about the host whose status is missing and the view of this
>>>>>>> host
>>>>>>>>> will
>>>>>>>>>> be wrong.
>>>>>>>>>> 
>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>>>>>> Gossip
>>>>>>>>>> snitch :)
>>>>>>>>>> 
>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>>>>>> when
>>>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>>>>>> be
>>>>>>>> in
>>>>>>>>>> all system tables?
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>>>> kohlisankalp@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> As long as we are correctly storing such things in the system
>>>>>>>> tables
>>>>>>>>>> and
>>>>>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>>>>>> information
>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>>>>>> GPFS
>>>>>>>>>> does
>>>>>>>>>>>>> this, but I have not done extensive code diving or testing to
>>>>>>> make
>>>>>>>>>> sure all
>>>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Jeremiah
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>>>> kohlisankalp@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>>>>>> bugs
>>>>>>>>>> where
>>>>>>>>>>>>> we
>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>>>> invalid
>>>>>>>>>>>>> replicas
>>>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>>>>>> it
>>>>>>>>>> offers
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>>>>>> when
>>>>>>>>>>>>> things
>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>>>> replace
>>>>>>>>>>>>> nodes in
>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>>>>>> the
>>>>>>>>>> other
>>>>>>>>>>>>> DCs
>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>>>> straightforward
>>>>>>>>>> to
>>>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>>>>>> checking
>>>>>>>>>> and
>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>>>>>> and
>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>>>>>> sufficient
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by Sankalp Kohli <ko...@gmail.com>.
No worries...I mentioned the issue not the JIRA number 

> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <je...@datastax.com> wrote:
> 
> Sorry, maybe my spam filter got them or something, but I have never seen a JIRA number mentioned in the thread before this one.  Just looked back through again to make sure, and this is the first email I have with one.
> 
> -Jeremiah
> 
>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com> wrote:
>> 
>> Here are some of the JIRAs which are fixed but actually did not fix the
>> issue. We have tried fixing this by several patches. May be it will be
>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
>> new JIRA as this issue still exists.
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= (related to it)
>> 
>> Also the quote you are using was written as a follow on email. I have
>> already said what the bug I was referring to.
>> 
>> "Say you restarted all instances in the cluster and status for some host
>> goes missing. Now when you start a host replacement, the new host won’t
>> learn about the host whose status is missing and the view of this host will
>> be wrong."
>> 
>>  - CASSANDRA-10366
>> 
>> 
>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <ko...@gmail.com>
>> wrote:
>> 
>>> I will send the JIRAs of the bug which we thought we have fixed but it
>>> still exists.
>>> 
>>> Have you done any correctness testing after doing all these tests...have
>>> you done the tests for 1000 instance clusters?
>>> 
>>> It is great you have done these tests and I am hoping the gossiping snitch
>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
>>> the bug which is fixed.
>>> 
>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <je...@gmail.com>
>>> wrote:
>>>> 
>>>> Do you have a specific gossip bug that you have seen recently which
>>> caused a problem that would make this happen?  Do you have a specific JIRA
>>> in mind?  “We can’t remove this because what if there is a bug” doesn’t
>>> seem like a good enough reason to me. If that was a reason we would never
>>> make any changes to anything.
>>>> I think many people have seen PFS actually cause real problems, where
>>> with GPFS the issue being talked about is predicated on some theoretical
>>> gossip bug happening.
>>>> In the past year at DataStax we have done a lot of testing on 3.0 and
>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
>>> and replacing DC’s, all while using GPFS, and as far as I know we have not
>>> seen any “lost” rack/DC information during such testing.
>>>> 
>>>> -Jeremiah
>>>> 
>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com>
>>> wrote:
>>>>> 
>>>>> We will have similar issues with Gossip but this will create more
>>> issues as
>>>>> more things will be relied on Gossip.
>>>>> 
>>>>> I agree PFS should be removed but I dont see how it can be with issues
>>> like
>>>>> these or someone proves that it wont cause any issues.
>>>>> 
>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I can understand keeping PFS for historical/compatibility reasons, but
>>> if
>>>>>> gossip is broken I think you will have similar ring view problems
>>> during
>>>>>> replace/bootstrap that would still occur with the use of PFS (such as
>>>>>> missing tokens, since those are propagated via gossip), so that doesn't
>>>>>> seem like a strong reason to keep it around.
>>>>>> 
>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>>>>> careful enough to have identical files across nodes and updating it
>>> when
>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>>> snitches.
>>>>>> While the rejection of verbs to invalid replicas on trunk could address
>>>>>> concerns raised by Jeremy, this would only happen after the new node
>>> joins
>>>>>> the ring, so you would need to re-bootstrap the node and lose all the
>>> work
>>>>>> done in the original bootstrap.
>>>>>> 
>>>>>> Perhaps one good reason to use PFS is the ability to easily package it
>>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>>>>> (which is also it's Achilles' heel). To keep this ability, we could
>>> make
>>>>>> GPFS compatible with the cassandra-topology.properties file, but
>>> reading
>>>>>> only the dc/rack info about the local node.
>>>>>> 
>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>>> kohlisankalp@gmail.com>
>>>>>> escreveu:
>>>>>> 
>>>>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>>>>> missing.
>>>>>>> 
>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>>> pauloricardomg@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>>> the new host won’t learn about the host whose status is missing and
>>>>>> the
>>>>>>>> view of this host will be wrong.
>>>>>>>> 
>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
>>> this
>>>>>>>> host will be missing from gossip/system.peers?
>>>>>>>> 
>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>>> kohlisankalp@gmail.com>
>>>>>>>> escreveu:
>>>>>>>> 
>>>>>>>>> Say you restarted all instances in the cluster and status for some
>>>>>> host
>>>>>>>>> goes missing. Now when you start a host replacement, the new host
>>>>>> won’t
>>>>>>>>> learn about the host whose status is missing and the view of this
>>>>>> host
>>>>>>>> will
>>>>>>>>> be wrong.
>>>>>>>>> 
>>>>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>>>>> Gossip
>>>>>>>>> snitch :)
>>>>>>>>> 
>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>>>>> when
>>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>>>>> be
>>>>>>> in
>>>>>>>>> all system tables?
>>>>>>>>>> 
>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>>> kohlisankalp@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> As long as we are correctly storing such things in the system
>>>>>>> tables
>>>>>>>>> and
>>>>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>>>>> information
>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>>>>> GPFS
>>>>>>>>> does
>>>>>>>>>>>> this, but I have not done extensive code diving or testing to
>>>>>> make
>>>>>>>>> sure all
>>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>>> 
>>>>>>>>>>>> -Jeremiah
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>>> kohlisankalp@gmail.com
>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>>>>> bugs
>>>>>>>>> where
>>>>>>>>>>>> we
>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>>> invalid
>>>>>>>>>>>> replicas
>>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>>>>> it
>>>>>>>>> offers
>>>>>>>>>>>> and
>>>>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>>>>> when
>>>>>>>>>>>> things
>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>>> replace
>>>>>>>>>>>> nodes in
>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>>>>> the
>>>>>>>>> other
>>>>>>>>>>>> DCs
>>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>>> straightforward
>>>>>>>>> to
>>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>>>>> checking
>>>>>>>>> and
>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>>>>> and
>>>>>>>>>>>> remove
>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>>>>> sufficient
>>>>>>>>>>>> to
>>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>> 
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeremiah D Jordan <je...@datastax.com>.
Sorry, maybe my spam filter got them or something, but I have never seen a JIRA number mentioned in the thread before this one.  Just looked back through again to make sure, and this is the first email I have with one.

-Jeremiah

> On Oct 22, 2018, at 9:37 PM, sankalp kohli <ko...@gmail.com> wrote:
> 
> Here are some of the JIRAs which are fixed but actually did not fix the
> issue. We have tried fixing this by several patches. May be it will be
> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
> new JIRA as this issue still exists.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= (related to it)
> 
> Also the quote you are using was written as a follow on email. I have
> already said what the bug I was referring to.
> 
> "Say you restarted all instances in the cluster and status for some host
> goes missing. Now when you start a host replacement, the new host won’t
> learn about the host whose status is missing and the view of this host will
> be wrong."
> 
>   - CASSANDRA-10366
> 
> 
> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <ko...@gmail.com>
> wrote:
> 
>> I will send the JIRAs of the bug which we thought we have fixed but it
>> still exists.
>> 
>> Have you done any correctness testing after doing all these tests...have
>> you done the tests for 1000 instance clusters?
>> 
>> It is great you have done these tests and I am hoping the gossiping snitch
>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
>> the bug which is fixed.
>> 
>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <je...@gmail.com>
>> wrote:
>>> 
>>> Do you have a specific gossip bug that you have seen recently which
>> caused a problem that would make this happen?  Do you have a specific JIRA
>> in mind?  “We can’t remove this because what if there is a bug” doesn’t
>> seem like a good enough reason to me. If that was a reason we would never
>> make any changes to anything.
>>> I think many people have seen PFS actually cause real problems, where
>> with GPFS the issue being talked about is predicated on some theoretical
>> gossip bug happening.
>>> In the past year at DataStax we have done a lot of testing on 3.0 and
>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
>> and replacing DC’s, all while using GPFS, and as far as I know we have not
>> seen any “lost” rack/DC information during such testing.
>>> 
>>> -Jeremiah
>>> 
>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com>
>> wrote:
>>>> 
>>>> We will have similar issues with Gossip but this will create more
>> issues as
>>>> more things will be relied on Gossip.
>>>> 
>>>> I agree PFS should be removed but I dont see how it can be with issues
>> like
>>>> these or someone proves that it wont cause any issues.
>>>> 
>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I can understand keeping PFS for historical/compatibility reasons, but
>> if
>>>>> gossip is broken I think you will have similar ring view problems
>> during
>>>>> replace/bootstrap that would still occur with the use of PFS (such as
>>>>> missing tokens, since those are propagated via gossip), so that doesn't
>>>>> seem like a strong reason to keep it around.
>>>>> 
>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>>>> careful enough to have identical files across nodes and updating it
>> when
>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>> snitches.
>>>>> While the rejection of verbs to invalid replicas on trunk could address
>>>>> concerns raised by Jeremy, this would only happen after the new node
>> joins
>>>>> the ring, so you would need to re-bootstrap the node and lose all the
>> work
>>>>> done in the original bootstrap.
>>>>> 
>>>>> Perhaps one good reason to use PFS is the ability to easily package it
>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>>>> (which is also it's Achilles' heel). To keep this ability, we could
>> make
>>>>> GPFS compatible with the cassandra-topology.properties file, but
>> reading
>>>>> only the dc/rack info about the local node.
>>>>> 
>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>> kohlisankalp@gmail.com>
>>>>> escreveu:
>>>>> 
>>>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>>>> missing.
>>>>>> 
>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>> pauloricardomg@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>>> the new host won’t learn about the host whose status is missing and
>>>>> the
>>>>>>> view of this host will be wrong.
>>>>>>> 
>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
>> this
>>>>>>> host will be missing from gossip/system.peers?
>>>>>>> 
>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>> kohlisankalp@gmail.com>
>>>>>>> escreveu:
>>>>>>> 
>>>>>>>> Say you restarted all instances in the cluster and status for some
>>>>> host
>>>>>>>> goes missing. Now when you start a host replacement, the new host
>>>>> won’t
>>>>>>>> learn about the host whose status is missing and the view of this
>>>>> host
>>>>>>> will
>>>>>>>> be wrong.
>>>>>>>> 
>>>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>>>> Gossip
>>>>>>>> snitch :)
>>>>>>>> 
>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>>>> when
>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>>>> be
>>>>>> in
>>>>>>>> all system tables?
>>>>>>>>> 
>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>> kohlisankalp@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> As long as we are correctly storing such things in the system
>>>>>> tables
>>>>>>>> and
>>>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>>>> information
>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>>>> GPFS
>>>>>>>> does
>>>>>>>>>>> this, but I have not done extensive code diving or testing to
>>>>> make
>>>>>>>> sure all
>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>> 
>>>>>>>>>>> -Jeremiah
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>> kohlisankalp@gmail.com
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>>>> bugs
>>>>>>>> where
>>>>>>>>>>> we
>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>> invalid
>>>>>>>>>>> replicas
>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>>>> it
>>>>>>>> offers
>>>>>>>>>>> and
>>>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>>>> when
>>>>>>>>>>> things
>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>> replace
>>>>>>>>>>> nodes in
>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>>>> the
>>>>>>>> other
>>>>>>>>>>> DCs
>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>> straightforward
>>>>>>>> to
>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>>>> checking
>>>>>>>> and
>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>>>> and
>>>>>>>>>>> remove
>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>>>> sufficient
>>>>>>>>>>> to
>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>> dev-help@cassandra.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by sankalp kohli <ko...@gmail.com>.
Here are some of the JIRAs which are fixed but actually did not fix the
issue. We have tried fixing this by several patches. May be it will be
fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
new JIRA as this issue still exists.
https://issues.apache.org/jira/browse/CASSANDRA-10366
https://issues.apache.org/jira/browse/CASSANDRA-10089 (related to it)

Also the quote you are using was written as a follow on email. I have
already said what the bug I was referring to.

"Say you restarted all instances in the cluster and status for some host
goes missing. Now when you start a host replacement, the new host won’t
learn about the host whose status is missing and the view of this host will
be wrong."

   - CASSANDRA-10366


On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <ko...@gmail.com>
wrote:

> I will send the JIRAs of the bug which we thought we have fixed but it
> still exists.
>
> Have you done any correctness testing after doing all these tests...have
> you done the tests for 1000 instance clusters?
>
> It is great you have done these tests and I am hoping the gossiping snitch
> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
> the bug which is fixed.
>
> > On Oct 22, 2018, at 7:09 PM, J. D. Jordan <je...@gmail.com>
> wrote:
> >
> > Do you have a specific gossip bug that you have seen recently which
> caused a problem that would make this happen?  Do you have a specific JIRA
> in mind?  “We can’t remove this because what if there is a bug” doesn’t
> seem like a good enough reason to me. If that was a reason we would never
> make any changes to anything.
> > I think many people have seen PFS actually cause real problems, where
> with GPFS the issue being talked about is predicated on some theoretical
> gossip bug happening.
> > In the past year at DataStax we have done a lot of testing on 3.0 and
> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
> and replacing DC’s, all while using GPFS, and as far as I know we have not
> seen any “lost” rack/DC information during such testing.
> >
> > -Jeremiah
> >
> >> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com>
> wrote:
> >>
> >> We will have similar issues with Gossip but this will create more
> issues as
> >> more things will be relied on Gossip.
> >>
> >> I agree PFS should be removed but I dont see how it can be with issues
> like
> >> these or someone proves that it wont cause any issues.
> >>
> >> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
> >> wrote:
> >>
> >>> I can understand keeping PFS for historical/compatibility reasons, but
> if
> >>> gossip is broken I think you will have similar ring view problems
> during
> >>> replace/bootstrap that would still occur with the use of PFS (such as
> >>> missing tokens, since those are propagated via gossip), so that doesn't
> >>> seem like a strong reason to keep it around.
> >>>
> >>> With PFS it's pretty easy to shoot yourself in the foot if you're not
> >>> careful enough to have identical files across nodes and updating it
> when
> >>> adding nodes/dcs, so it's seems to be less foolproof than other
> snitches.
> >>> While the rejection of verbs to invalid replicas on trunk could address
> >>> concerns raised by Jeremy, this would only happen after the new node
> joins
> >>> the ring, so you would need to re-bootstrap the node and lose all the
> work
> >>> done in the original bootstrap.
> >>>
> >>> Perhaps one good reason to use PFS is the ability to easily package it
> >>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
> >>> (which is also it's Achilles' heel). To keep this ability, we could
> make
> >>> GPFS compatible with the cassandra-topology.properties file, but
> reading
> >>> only the dc/rack info about the local node.
> >>>
> >>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
> kohlisankalp@gmail.com>
> >>> escreveu:
> >>>
> >>>> Yes it will happen. I am worried that same way DC or rack info can go
> >>>> missing.
> >>>>
> >>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
> pauloricardomg@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>> the new host won’t learn about the host whose status is missing and
> >>> the
> >>>>> view of this host will be wrong.
> >>>>>
> >>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
> this
> >>>>> host will be missing from gossip/system.peers?
> >>>>>
> >>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
> >>>> kohlisankalp@gmail.com>
> >>>>> escreveu:
> >>>>>
> >>>>>> Say you restarted all instances in the cluster and status for some
> >>> host
> >>>>>> goes missing. Now when you start a host replacement, the new host
> >>> won’t
> >>>>>> learn about the host whose status is missing and the view of this
> >>> host
> >>>>> will
> >>>>>> be wrong.
> >>>>>>
> >>>>>> PS: I will be happy to be proved wrong as I can also start using
> >>> Gossip
> >>>>>> snitch :)
> >>>>>>
> >>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
> >>>> jeremy.hanna1234@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Do you mean to say that during host replacement there may be a time
> >>>>> when
> >>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
> >>> be
> >>>> in
> >>>>>> all system tables?
> >>>>>>>
> >>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
> >>> kohlisankalp@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> This is not the case during host replacement correct?
> >>>>>>>>
> >>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> >>>>>>>> jeremiah.jordan@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> As long as we are correctly storing such things in the system
> >>>> tables
> >>>>>> and
> >>>>>>>>> reading them out of the system tables when we do not have the
> >>>>>> information
> >>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
> >>> GPFS
> >>>>>> does
> >>>>>>>>> this, but I have not done extensive code diving or testing to
> >>> make
> >>>>>> sure all
> >>>>>>>>> edge cases are covered there)
> >>>>>>>>>
> >>>>>>>>> -Jeremiah
> >>>>>>>>>
> >>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
> >>>> kohlisankalp@gmail.com
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
> >>> bugs
> >>>>>> where
> >>>>>>>>> we
> >>>>>>>>>> lose hostId or some other fields when we restart C* for large
> >>>>>>>>>> clusters(~1000 instances)?
> >>>>>>>>>>
> >>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
> >>> invalid
> >>>>>>>>> replicas
> >>>>>>>>>>> solves a lot of the concerns here
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Jeff Jirsa
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> >>>>>> jeremy.hanna1234@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
> >>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
> >>> it
> >>>>>> offers
> >>>>>>>>> and
> >>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
> >>>> when
> >>>>>>>>> things
> >>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
> >>>> replace
> >>>>>>>>> nodes in
> >>>>>>>>>>> one DC and add those nodes to that DCs property files and not
> >>> the
> >>>>>> other
> >>>>>>>>> DCs
> >>>>>>>>>>> property files - the resulting problems aren’t very
> >>>> straightforward
> >>>>>> to
> >>>>>>>>>>> troubleshoot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We could try to improve the resilience and fail fast error
> >>>>> checking
> >>>>>> and
> >>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
> >>>> and
> >>>>>>>>> remove
> >>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
> >>>>>> sufficient
> >>>>>>>>> to
> >>>>>>>>>>> replace it?
> >>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>>> For additional commands, e-mail:
> >>> dev-help@cassandra.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Sankalp Kohli <ko...@gmail.com>.
I will send the JIRAs of the bug which we thought we have fixed but it still exists. 

Have you done any correctness testing after doing all these tests...have you done the tests for 1000 instance clusters? 

It is great you have done these tests and I am hoping the gossiping snitch is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing the bug which is fixed. 

> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <je...@gmail.com> wrote:
> 
> Do you have a specific gossip bug that you have seen recently which caused a problem that would make this happen?  Do you have a specific JIRA in mind?  “We can’t remove this because what if there is a bug” doesn’t seem like a good enough reason to me. If that was a reason we would never make any changes to anything.
> I think many people have seen PFS actually cause real problems, where with GPFS the issue being talked about is predicated on some theoretical gossip bug happening.
> In the past year at DataStax we have done a lot of testing on 3.0 and 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks, and replacing DC’s, all while using GPFS, and as far as I know we have not seen any “lost” rack/DC information during such testing.
> 
> -Jeremiah
> 
>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com> wrote:
>> 
>> We will have similar issues with Gossip but this will create more issues as
>> more things will be relied on Gossip.
>> 
>> I agree PFS should be removed but I dont see how it can be with issues like
>> these or someone proves that it wont cause any issues.
>> 
>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
>> wrote:
>> 
>>> I can understand keeping PFS for historical/compatibility reasons, but if
>>> gossip is broken I think you will have similar ring view problems during
>>> replace/bootstrap that would still occur with the use of PFS (such as
>>> missing tokens, since those are propagated via gossip), so that doesn't
>>> seem like a strong reason to keep it around.
>>> 
>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>> careful enough to have identical files across nodes and updating it when
>>> adding nodes/dcs, so it's seems to be less foolproof than other snitches.
>>> While the rejection of verbs to invalid replicas on trunk could address
>>> concerns raised by Jeremy, this would only happen after the new node joins
>>> the ring, so you would need to re-bootstrap the node and lose all the work
>>> done in the original bootstrap.
>>> 
>>> Perhaps one good reason to use PFS is the ability to easily package it
>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>> (which is also it's Achilles' heel). To keep this ability, we could make
>>> GPFS compatible with the cassandra-topology.properties file, but reading
>>> only the dc/rack info about the local node.
>>> 
>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <ko...@gmail.com>
>>> escreveu:
>>> 
>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>> missing.
>>>> 
>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <pa...@gmail.com>
>>>> wrote:
>>>> 
>>>>>> the new host won’t learn about the host whose status is missing and
>>> the
>>>>> view of this host will be wrong.
>>>>> 
>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for this
>>>>> host will be missing from gossip/system.peers?
>>>>> 
>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>> kohlisankalp@gmail.com>
>>>>> escreveu:
>>>>> 
>>>>>> Say you restarted all instances in the cluster and status for some
>>> host
>>>>>> goes missing. Now when you start a host replacement, the new host
>>> won’t
>>>>>> learn about the host whose status is missing and the view of this
>>> host
>>>>> will
>>>>>> be wrong.
>>>>>> 
>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>> Gossip
>>>>>> snitch :)
>>>>>> 
>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>> jeremy.hanna1234@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>> when
>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>> be
>>>> in
>>>>>> all system tables?
>>>>>>> 
>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>> kohlisankalp@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>> This is not the case during host replacement correct?
>>>>>>>> 
>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> As long as we are correctly storing such things in the system
>>>> tables
>>>>>> and
>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>> information
>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>> GPFS
>>>>>> does
>>>>>>>>> this, but I have not done extensive code diving or testing to
>>> make
>>>>>> sure all
>>>>>>>>> edge cases are covered there)
>>>>>>>>> 
>>>>>>>>> -Jeremiah
>>>>>>>>> 
>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>> kohlisankalp@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>> bugs
>>>>>> where
>>>>>>>>> we
>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>> invalid
>>>>>>>>> replicas
>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>> it
>>>>>> offers
>>>>>>>>> and
>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>> when
>>>>>>>>> things
>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>> replace
>>>>>>>>> nodes in
>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>> the
>>>>>> other
>>>>>>>>> DCs
>>>>>>>>>>> property files - the resulting problems aren’t very
>>>> straightforward
>>>>>> to
>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>> 
>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>> checking
>>>>>> and
>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>> and
>>>>>>>>> remove
>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>> sufficient
>>>>>>>>> to
>>>>>>>>>>> replace it?
>>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>>> For additional commands, e-mail:
>>> dev-help@cassandra.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by "J. D. Jordan" <je...@gmail.com>.
Do you have a specific gossip bug that you have seen recently which caused a problem that would make this happen?  Do you have a specific JIRA in mind?  “We can’t remove this because what if there is a bug” doesn’t seem like a good enough reason to me. If that was a reason we would never make any changes to anything.
I think many people have seen PFS actually cause real problems, where with GPFS the issue being talked about is predicated on some theoretical gossip bug happening.
In the past year at DataStax we have done a lot of testing on 3.0 and 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks, and replacing DC’s, all while using GPFS, and as far as I know we have not seen any “lost” rack/DC information during such testing.

-Jeremiah

> On Oct 22, 2018, at 5:46 PM, sankalp kohli <ko...@gmail.com> wrote:
> 
> We will have similar issues with Gossip but this will create more issues as
> more things will be relied on Gossip.
> 
> I agree PFS should be removed but I dont see how it can be with issues like
> these or someone proves that it wont cause any issues.
> 
> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
> wrote:
> 
>> I can understand keeping PFS for historical/compatibility reasons, but if
>> gossip is broken I think you will have similar ring view problems during
>> replace/bootstrap that would still occur with the use of PFS (such as
>> missing tokens, since those are propagated via gossip), so that doesn't
>> seem like a strong reason to keep it around.
>> 
>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>> careful enough to have identical files across nodes and updating it when
>> adding nodes/dcs, so it's seems to be less foolproof than other snitches.
>> While the rejection of verbs to invalid replicas on trunk could address
>> concerns raised by Jeremy, this would only happen after the new node joins
>> the ring, so you would need to re-bootstrap the node and lose all the work
>> done in the original bootstrap.
>> 
>> Perhaps one good reason to use PFS is the ability to easily package it
>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>> (which is also it's Achilles' heel). To keep this ability, we could make
>> GPFS compatible with the cassandra-topology.properties file, but reading
>> only the dc/rack info about the local node.
>> 
>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <ko...@gmail.com>
>> escreveu:
>> 
>>> Yes it will happen. I am worried that same way DC or rack info can go
>>> missing.
>>> 
>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <pa...@gmail.com>
>>> wrote:
>>> 
>>>>> the new host won’t learn about the host whose status is missing and
>> the
>>>> view of this host will be wrong.
>>>> 
>>>> Won't this happen even with PropertyFileSnitch as the token(s) for this
>>>> host will be missing from gossip/system.peers?
>>>> 
>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>> kohlisankalp@gmail.com>
>>>> escreveu:
>>>> 
>>>>> Say you restarted all instances in the cluster and status for some
>> host
>>>>> goes missing. Now when you start a host replacement, the new host
>> won’t
>>>>> learn about the host whose status is missing and the view of this
>> host
>>>> will
>>>>> be wrong.
>>>>> 
>>>>> PS: I will be happy to be proved wrong as I can also start using
>> Gossip
>>>>> snitch :)
>>>>> 
>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>> jeremy.hanna1234@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Do you mean to say that during host replacement there may be a time
>>>> when
>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>> be
>>> in
>>>>> all system tables?
>>>>>> 
>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>> kohlisankalp@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> This is not the case during host replacement correct?
>>>>>>> 
>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>> jeremiah.jordan@gmail.com> wrote:
>>>>>>> 
>>>>>>>> As long as we are correctly storing such things in the system
>>> tables
>>>>> and
>>>>>>>> reading them out of the system tables when we do not have the
>>>>> information
>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>> GPFS
>>>>> does
>>>>>>>> this, but I have not done extensive code diving or testing to
>> make
>>>>> sure all
>>>>>>>> edge cases are covered there)
>>>>>>>> 
>>>>>>>> -Jeremiah
>>>>>>>> 
>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>> kohlisankalp@gmail.com
>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>> bugs
>>>>> where
>>>>>>>> we
>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>> invalid
>>>>>>>> replicas
>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Jeff Jirsa
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>> jeremy.hanna1234@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>> it
>>>>> offers
>>>>>>>> and
>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>> when
>>>>>>>> things
>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>> replace
>>>>>>>> nodes in
>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>> the
>>>>> other
>>>>>>>> DCs
>>>>>>>>>> property files - the resulting problems aren’t very
>>> straightforward
>>>>> to
>>>>>>>>>> troubleshoot.
>>>>>>>>>>> 
>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>> checking
>>>>> and
>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>> and
>>>>>>>> remove
>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>> sufficient
>>>>>>>> to
>>>>>>>>>> replace it?
>>>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@cassandra.apache.org
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>>> 
>>>> 
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by sankalp kohli <ko...@gmail.com>.
We will have similar issues with Gossip but this will create more issues as
more things will be relied on Gossip.

I agree PFS should be removed but I dont see how it can be with issues like
these or someone proves that it wont cause any issues.

On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pa...@gmail.com>
wrote:

> I can understand keeping PFS for historical/compatibility reasons, but if
> gossip is broken I think you will have similar ring view problems during
> replace/bootstrap that would still occur with the use of PFS (such as
> missing tokens, since those are propagated via gossip), so that doesn't
> seem like a strong reason to keep it around.
>
> With PFS it's pretty easy to shoot yourself in the foot if you're not
> careful enough to have identical files across nodes and updating it when
> adding nodes/dcs, so it's seems to be less foolproof than other snitches.
> While the rejection of verbs to invalid replicas on trunk could address
> concerns raised by Jeremy, this would only happen after the new node joins
> the ring, so you would need to re-bootstrap the node and lose all the work
> done in the original bootstrap.
>
> Perhaps one good reason to use PFS is the ability to easily package it
> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
> (which is also it's Achilles' heel). To keep this ability, we could make
> GPFS compatible with the cassandra-topology.properties file, but reading
> only the dc/rack info about the local node.
>
> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <ko...@gmail.com>
> escreveu:
>
> > Yes it will happen. I am worried that same way DC or rack info can go
> > missing.
> >
> > On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <pa...@gmail.com>
> > wrote:
> >
> > > > the new host won’t learn about the host whose status is missing and
> the
> > > view of this host will be wrong.
> > >
> > > Won't this happen even with PropertyFileSnitch as the token(s) for this
> > > host will be missing from gossip/system.peers?
> > >
> > > Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
> > kohlisankalp@gmail.com>
> > > escreveu:
> > >
> > > > Say you restarted all instances in the cluster and status for some
> host
> > > > goes missing. Now when you start a host replacement, the new host
> won’t
> > > > learn about the host whose status is missing and the view of this
> host
> > > will
> > > > be wrong.
> > > >
> > > > PS: I will be happy to be proved wrong as I can also start using
> Gossip
> > > > snitch :)
> > > >
> > > > > On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
> > jeremy.hanna1234@gmail.com>
> > > > wrote:
> > > > >
> > > > > Do you mean to say that during host replacement there may be a time
> > > when
> > > > the old->new host isn’t fully propagated and therefore wouldn’t yet
> be
> > in
> > > > all system tables?
> > > > >
> > > > >> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
> kohlisankalp@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> This is not the case during host replacement correct?
> > > > >>
> > > > >> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> > > > >> jeremiah.jordan@gmail.com> wrote:
> > > > >>
> > > > >>> As long as we are correctly storing such things in the system
> > tables
> > > > and
> > > > >>> reading them out of the system tables when we do not have the
> > > > information
> > > > >>> from gossip yet, it should not be a problem. (As far as I know
> GPFS
> > > > does
> > > > >>> this, but I have not done extensive code diving or testing to
> make
> > > > sure all
> > > > >>> edge cases are covered there)
> > > > >>>
> > > > >>> -Jeremiah
> > > > >>>
> > > > >>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
> > kohlisankalp@gmail.com
> > > >
> > > > >>> wrote:
> > > > >>>>
> > > > >>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
> bugs
> > > > where
> > > > >>> we
> > > > >>>> lose hostId or some other fields when we restart C* for large
> > > > >>>> clusters(~1000 instances)?
> > > > >>>>
> > > > >>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
> > > wrote:
> > > > >>>>>
> > > > >>>>> We should, but the 4.0 features that log/reject verbs to
> invalid
> > > > >>> replicas
> > > > >>>>> solves a lot of the concerns here
> > > > >>>>>
> > > > >>>>> --
> > > > >>>>> Jeff Jirsa
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> > > > jeremy.hanna1234@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>> We have had PropertyFileSnitch for a long time even though
> > > > >>>>> GossipingPropertyFileSnitch is effectively a superset of what
> it
> > > > offers
> > > > >>> and
> > > > >>>>> is much less error prone.  There are some unexpected behaviors
> > when
> > > > >>> things
> > > > >>>>> aren’t configured correctly with PFS.  For example, if you
> > replace
> > > > >>> nodes in
> > > > >>>>> one DC and add those nodes to that DCs property files and not
> the
> > > > other
> > > > >>> DCs
> > > > >>>>> property files - the resulting problems aren’t very
> > straightforward
> > > > to
> > > > >>>>> troubleshoot.
> > > > >>>>>>
> > > > >>>>>> We could try to improve the resilience and fail fast error
> > > checking
> > > > and
> > > > >>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
> > and
> > > > >>> remove
> > > > >>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
> > > > sufficient
> > > > >>> to
> > > > >>>>> replace it?
> > > > >>>>>>
> > > > ---------------------------------------------------------------------
> > > > >>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > >>>>>> For additional commands, e-mail:
> dev-help@cassandra.apache.org
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > ---------------------------------------------------------------------
> > > > >>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > >>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>>
> > > > >>>
> > ---------------------------------------------------------------------
> > > > >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >>>
> > > > >>>
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >
> > > >
> > >
> >
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Paulo Motta <pa...@gmail.com>.
I can understand keeping PFS for historical/compatibility reasons, but if
gossip is broken I think you will have similar ring view problems during
replace/bootstrap that would still occur with the use of PFS (such as
missing tokens, since those are propagated via gossip), so that doesn't
seem like a strong reason to keep it around.

With PFS it's pretty easy to shoot yourself in the foot if you're not
careful enough to have identical files across nodes and updating it when
adding nodes/dcs, so it's seems to be less foolproof than other snitches.
While the rejection of verbs to invalid replicas on trunk could address
concerns raised by Jeremy, this would only happen after the new node joins
the ring, so you would need to re-bootstrap the node and lose all the work
done in the original bootstrap.

Perhaps one good reason to use PFS is the ability to easily package it
across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
(which is also it's Achilles' heel). To keep this ability, we could make
GPFS compatible with the cassandra-topology.properties file, but reading
only the dc/rack info about the local node.

Em seg, 22 de out de 2018 às 16:58, sankalp kohli <ko...@gmail.com>
escreveu:

> Yes it will happen. I am worried that same way DC or rack info can go
> missing.
>
> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <pa...@gmail.com>
> wrote:
>
> > > the new host won’t learn about the host whose status is missing and the
> > view of this host will be wrong.
> >
> > Won't this happen even with PropertyFileSnitch as the token(s) for this
> > host will be missing from gossip/system.peers?
> >
> > Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
> kohlisankalp@gmail.com>
> > escreveu:
> >
> > > Say you restarted all instances in the cluster and status for some host
> > > goes missing. Now when you start a host replacement, the new host won’t
> > > learn about the host whose status is missing and the view of this host
> > will
> > > be wrong.
> > >
> > > PS: I will be happy to be proved wrong as I can also start using Gossip
> > > snitch :)
> > >
> > > > On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com>
> > > wrote:
> > > >
> > > > Do you mean to say that during host replacement there may be a time
> > when
> > > the old->new host isn’t fully propagated and therefore wouldn’t yet be
> in
> > > all system tables?
> > > >
> > > >> On Oct 17, 2018, at 4:20 PM, sankalp kohli <ko...@gmail.com>
> > > wrote:
> > > >>
> > > >> This is not the case during host replacement correct?
> > > >>
> > > >> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> > > >> jeremiah.jordan@gmail.com> wrote:
> > > >>
> > > >>> As long as we are correctly storing such things in the system
> tables
> > > and
> > > >>> reading them out of the system tables when we do not have the
> > > information
> > > >>> from gossip yet, it should not be a problem. (As far as I know GPFS
> > > does
> > > >>> this, but I have not done extensive code diving or testing to make
> > > sure all
> > > >>> edge cases are covered there)
> > > >>>
> > > >>> -Jeremiah
> > > >>>
> > > >>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
> kohlisankalp@gmail.com
> > >
> > > >>> wrote:
> > > >>>>
> > > >>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs
> > > where
> > > >>> we
> > > >>>> lose hostId or some other fields when we restart C* for large
> > > >>>> clusters(~1000 instances)?
> > > >>>>
> > > >>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
> > wrote:
> > > >>>>>
> > > >>>>> We should, but the 4.0 features that log/reject verbs to invalid
> > > >>> replicas
> > > >>>>> solves a lot of the concerns here
> > > >>>>>
> > > >>>>> --
> > > >>>>> Jeff Jirsa
> > > >>>>>
> > > >>>>>
> > > >>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> > > jeremy.hanna1234@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>> We have had PropertyFileSnitch for a long time even though
> > > >>>>> GossipingPropertyFileSnitch is effectively a superset of what it
> > > offers
> > > >>> and
> > > >>>>> is much less error prone.  There are some unexpected behaviors
> when
> > > >>> things
> > > >>>>> aren’t configured correctly with PFS.  For example, if you
> replace
> > > >>> nodes in
> > > >>>>> one DC and add those nodes to that DCs property files and not the
> > > other
> > > >>> DCs
> > > >>>>> property files - the resulting problems aren’t very
> straightforward
> > > to
> > > >>>>> troubleshoot.
> > > >>>>>>
> > > >>>>>> We could try to improve the resilience and fail fast error
> > checking
> > > and
> > > >>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
> and
> > > >>> remove
> > > >>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
> > > sufficient
> > > >>> to
> > > >>>>> replace it?
> > > >>>>>>
> > > ---------------------------------------------------------------------
> > > >>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > >>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > ---------------------------------------------------------------------
> > > >>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > >>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >>>
> > > >>>
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> > >
> >
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by sankalp kohli <ko...@gmail.com>.
Yes it will happen. I am worried that same way DC or rack info can go
missing.

On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <pa...@gmail.com>
wrote:

> > the new host won’t learn about the host whose status is missing and the
> view of this host will be wrong.
>
> Won't this happen even with PropertyFileSnitch as the token(s) for this
> host will be missing from gossip/system.peers?
>
> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <ko...@gmail.com>
> escreveu:
>
> > Say you restarted all instances in the cluster and status for some host
> > goes missing. Now when you start a host replacement, the new host won’t
> > learn about the host whose status is missing and the view of this host
> will
> > be wrong.
> >
> > PS: I will be happy to be proved wrong as I can also start using Gossip
> > snitch :)
> >
> > > On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <je...@gmail.com>
> > wrote:
> > >
> > > Do you mean to say that during host replacement there may be a time
> when
> > the old->new host isn’t fully propagated and therefore wouldn’t yet be in
> > all system tables?
> > >
> > >> On Oct 17, 2018, at 4:20 PM, sankalp kohli <ko...@gmail.com>
> > wrote:
> > >>
> > >> This is not the case during host replacement correct?
> > >>
> > >> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> > >> jeremiah.jordan@gmail.com> wrote:
> > >>
> > >>> As long as we are correctly storing such things in the system tables
> > and
> > >>> reading them out of the system tables when we do not have the
> > information
> > >>> from gossip yet, it should not be a problem. (As far as I know GPFS
> > does
> > >>> this, but I have not done extensive code diving or testing to make
> > sure all
> > >>> edge cases are covered there)
> > >>>
> > >>> -Jeremiah
> > >>>
> > >>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <kohlisankalp@gmail.com
> >
> > >>> wrote:
> > >>>>
> > >>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs
> > where
> > >>> we
> > >>>> lose hostId or some other fields when we restart C* for large
> > >>>> clusters(~1000 instances)?
> > >>>>
> > >>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com>
> wrote:
> > >>>>>
> > >>>>> We should, but the 4.0 features that log/reject verbs to invalid
> > >>> replicas
> > >>>>> solves a lot of the concerns here
> > >>>>>
> > >>>>> --
> > >>>>> Jeff Jirsa
> > >>>>>
> > >>>>>
> > >>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> > jeremy.hanna1234@gmail.com>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> We have had PropertyFileSnitch for a long time even though
> > >>>>> GossipingPropertyFileSnitch is effectively a superset of what it
> > offers
> > >>> and
> > >>>>> is much less error prone.  There are some unexpected behaviors when
> > >>> things
> > >>>>> aren’t configured correctly with PFS.  For example, if you replace
> > >>> nodes in
> > >>>>> one DC and add those nodes to that DCs property files and not the
> > other
> > >>> DCs
> > >>>>> property files - the resulting problems aren’t very straightforward
> > to
> > >>>>> troubleshoot.
> > >>>>>>
> > >>>>>> We could try to improve the resilience and fail fast error
> checking
> > and
> > >>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate and
> > >>> remove
> > >>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
> > sufficient
> > >>> to
> > >>>>> replace it?
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > >>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > >>>>>>
> > >>>>>
> > >>>>>
> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > >>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> > >>>
> > >>>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Paulo Motta <pa...@gmail.com>.
> the new host won’t learn about the host whose status is missing and the
view of this host will be wrong.

Won't this happen even with PropertyFileSnitch as the token(s) for this
host will be missing from gossip/system.peers?

Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <ko...@gmail.com>
escreveu:

> Say you restarted all instances in the cluster and status for some host
> goes missing. Now when you start a host replacement, the new host won’t
> learn about the host whose status is missing and the view of this host will
> be wrong.
>
> PS: I will be happy to be proved wrong as I can also start using Gossip
> snitch :)
>
> > On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <je...@gmail.com>
> wrote:
> >
> > Do you mean to say that during host replacement there may be a time when
> the old->new host isn’t fully propagated and therefore wouldn’t yet be in
> all system tables?
> >
> >> On Oct 17, 2018, at 4:20 PM, sankalp kohli <ko...@gmail.com>
> wrote:
> >>
> >> This is not the case during host replacement correct?
> >>
> >> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> >> jeremiah.jordan@gmail.com> wrote:
> >>
> >>> As long as we are correctly storing such things in the system tables
> and
> >>> reading them out of the system tables when we do not have the
> information
> >>> from gossip yet, it should not be a problem. (As far as I know GPFS
> does
> >>> this, but I have not done extensive code diving or testing to make
> sure all
> >>> edge cases are covered there)
> >>>
> >>> -Jeremiah
> >>>
> >>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <ko...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs
> where
> >>> we
> >>>> lose hostId or some other fields when we restart C* for large
> >>>> clusters(~1000 instances)?
> >>>>
> >>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com> wrote:
> >>>>>
> >>>>> We should, but the 4.0 features that log/reject verbs to invalid
> >>> replicas
> >>>>> solves a lot of the concerns here
> >>>>>
> >>>>> --
> >>>>> Jeff Jirsa
> >>>>>
> >>>>>
> >>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
> jeremy.hanna1234@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> We have had PropertyFileSnitch for a long time even though
> >>>>> GossipingPropertyFileSnitch is effectively a superset of what it
> offers
> >>> and
> >>>>> is much less error prone.  There are some unexpected behaviors when
> >>> things
> >>>>> aren’t configured correctly with PFS.  For example, if you replace
> >>> nodes in
> >>>>> one DC and add those nodes to that DCs property files and not the
> other
> >>> DCs
> >>>>> property files - the resulting problems aren’t very straightforward
> to
> >>>>> troubleshoot.
> >>>>>>
> >>>>>> We could try to improve the resilience and fail fast error checking
> and
> >>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate and
> >>> remove
> >>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
> sufficient
> >>> to
> >>>>> replace it?
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>>>
> >>>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Sankalp Kohli <ko...@gmail.com>.
Say you restarted all instances in the cluster and status for some host goes missing. Now when you start a host replacement, the new host won’t learn about the host whose status is missing and the view of this host will be wrong.

PS: I will be happy to be proved wrong as I can also start using Gossip snitch :) 

> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <je...@gmail.com> wrote:
> 
> Do you mean to say that during host replacement there may be a time when the old->new host isn’t fully propagated and therefore wouldn’t yet be in all system tables?
> 
>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <ko...@gmail.com> wrote:
>> 
>> This is not the case during host replacement correct?
>> 
>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>> jeremiah.jordan@gmail.com> wrote:
>> 
>>> As long as we are correctly storing such things in the system tables and
>>> reading them out of the system tables when we do not have the information
>>> from gossip yet, it should not be a problem. (As far as I know GPFS does
>>> this, but I have not done extensive code diving or testing to make sure all
>>> edge cases are covered there)
>>> 
>>> -Jeremiah
>>> 
>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <ko...@gmail.com>
>>> wrote:
>>>> 
>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs where
>>> we
>>>> lose hostId or some other fields when we restart C* for large
>>>> clusters(~1000 instances)?
>>>> 
>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>>>> 
>>>>> We should, but the 4.0 features that log/reject verbs to invalid
>>> replicas
>>>>> solves a lot of the concerns here
>>>>> 
>>>>> --
>>>>> Jeff Jirsa
>>>>> 
>>>>> 
>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <je...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>> GossipingPropertyFileSnitch is effectively a superset of what it offers
>>> and
>>>>> is much less error prone.  There are some unexpected behaviors when
>>> things
>>>>> aren’t configured correctly with PFS.  For example, if you replace
>>> nodes in
>>>>> one DC and add those nodes to that DCs property files and not the other
>>> DCs
>>>>> property files - the resulting problems aren’t very straightforward to
>>>>> troubleshoot.
>>>>>> 
>>>>>> We could try to improve the resilience and fail fast error checking and
>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate and
>>> remove
>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient
>>> to
>>>>> replace it?
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeremy Hanna <je...@gmail.com>.
Do you mean to say that during host replacement there may be a time when the old->new host isn’t fully propagated and therefore wouldn’t yet be in all system tables?

> On Oct 17, 2018, at 4:20 PM, sankalp kohli <ko...@gmail.com> wrote:
> 
> This is not the case during host replacement correct?
> 
> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
> jeremiah.jordan@gmail.com> wrote:
> 
>> As long as we are correctly storing such things in the system tables and
>> reading them out of the system tables when we do not have the information
>> from gossip yet, it should not be a problem. (As far as I know GPFS does
>> this, but I have not done extensive code diving or testing to make sure all
>> edge cases are covered there)
>> 
>> -Jeremiah
>> 
>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <ko...@gmail.com>
>> wrote:
>>> 
>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs where
>> we
>>> lose hostId or some other fields when we restart C* for large
>>> clusters(~1000 instances)?
>>> 
>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com> wrote:
>>> 
>>>> We should, but the 4.0 features that log/reject verbs to invalid
>> replicas
>>>> solves a lot of the concerns here
>>>> 
>>>> --
>>>> Jeff Jirsa
>>>> 
>>>> 
>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <je...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> We have had PropertyFileSnitch for a long time even though
>>>> GossipingPropertyFileSnitch is effectively a superset of what it offers
>> and
>>>> is much less error prone.  There are some unexpected behaviors when
>> things
>>>> aren’t configured correctly with PFS.  For example, if you replace
>> nodes in
>>>> one DC and add those nodes to that DCs property files and not the other
>> DCs
>>>> property files - the resulting problems aren’t very straightforward to
>>>> troubleshoot.
>>>>> 
>>>>> We could try to improve the resilience and fail fast error checking and
>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate and
>> remove
>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient
>> to
>>>> replace it?
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by sankalp kohli <ko...@gmail.com>.
This is not the case during host replacement correct?

On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
jeremiah.jordan@gmail.com> wrote:

> As long as we are correctly storing such things in the system tables and
> reading them out of the system tables when we do not have the information
> from gossip yet, it should not be a problem. (As far as I know GPFS does
> this, but I have not done extensive code diving or testing to make sure all
> edge cases are covered there)
>
> -Jeremiah
>
> > On Oct 16, 2018, at 11:56 AM, sankalp kohli <ko...@gmail.com>
> wrote:
> >
> > Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs where
> we
> > lose hostId or some other fields when we restart C* for large
> > clusters(~1000 instances)?
> >
> > On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com> wrote:
> >
> >> We should, but the 4.0 features that log/reject verbs to invalid
> replicas
> >> solves a lot of the concerns here
> >>
> >> --
> >> Jeff Jirsa
> >>
> >>
> >>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <je...@gmail.com>
> >> wrote:
> >>>
> >>> We have had PropertyFileSnitch for a long time even though
> >> GossipingPropertyFileSnitch is effectively a superset of what it offers
> and
> >> is much less error prone.  There are some unexpected behaviors when
> things
> >> aren’t configured correctly with PFS.  For example, if you replace
> nodes in
> >> one DC and add those nodes to that DCs property files and not the other
> DCs
> >> property files - the resulting problems aren’t very straightforward to
> >> troubleshoot.
> >>>
> >>> We could try to improve the resilience and fail fast error checking and
> >> error reporting of PFS, but honestly, why wouldn’t we deprecate and
> remove
> >> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient
> to
> >> replace it?
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeremiah D Jordan <je...@gmail.com>.
As long as we are correctly storing such things in the system tables and reading them out of the system tables when we do not have the information from gossip yet, it should not be a problem. (As far as I know GPFS does this, but I have not done extensive code diving or testing to make sure all edge cases are covered there)

-Jeremiah

> On Oct 16, 2018, at 11:56 AM, sankalp kohli <ko...@gmail.com> wrote:
> 
> Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs where we
> lose hostId or some other fields when we restart C* for large
> clusters(~1000 instances)?
> 
> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com> wrote:
> 
>> We should, but the 4.0 features that log/reject verbs to invalid replicas
>> solves a lot of the concerns here
>> 
>> --
>> Jeff Jirsa
>> 
>> 
>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <je...@gmail.com>
>> wrote:
>>> 
>>> We have had PropertyFileSnitch for a long time even though
>> GossipingPropertyFileSnitch is effectively a superset of what it offers and
>> is much less error prone.  There are some unexpected behaviors when things
>> aren’t configured correctly with PFS.  For example, if you replace nodes in
>> one DC and add those nodes to that DCs property files and not the other DCs
>> property files - the resulting problems aren’t very straightforward to
>> troubleshoot.
>>> 
>>> We could try to improve the resilience and fail fast error checking and
>> error reporting of PFS, but honestly, why wouldn’t we deprecate and remove
>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient to
>> replace it?
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Deprecating/removing PropertyFileSnitch?

Posted by sankalp kohli <ko...@gmail.com>.
Will GossipingPropertyFileSnitch not be vulnerable to Gossip bugs where we
lose hostId or some other fields when we restart C* for large
clusters(~1000 instances)?

On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jj...@gmail.com> wrote:

> We should, but the 4.0 features that log/reject verbs to invalid replicas
> solves a lot of the concerns here
>
> --
> Jeff Jirsa
>
>
> > On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <je...@gmail.com>
> wrote:
> >
> > We have had PropertyFileSnitch for a long time even though
> GossipingPropertyFileSnitch is effectively a superset of what it offers and
> is much less error prone.  There are some unexpected behaviors when things
> aren’t configured correctly with PFS.  For example, if you replace nodes in
> one DC and add those nodes to that DCs property files and not the other DCs
> property files - the resulting problems aren’t very straightforward to
> troubleshoot.
> >
> > We could try to improve the resilience and fail fast error checking and
> error reporting of PFS, but honestly, why wouldn’t we deprecate and remove
> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient to
> replace it?
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Deprecating/removing PropertyFileSnitch?

Posted by Jeff Jirsa <jj...@gmail.com>.
We should, but the 4.0 features that log/reject verbs to invalid replicas solves a lot of the concerns here 

-- 
Jeff Jirsa


> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <je...@gmail.com> wrote:
> 
> We have had PropertyFileSnitch for a long time even though GossipingPropertyFileSnitch is effectively a superset of what it offers and is much less error prone.  There are some unexpected behaviors when things aren’t configured correctly with PFS.  For example, if you replace nodes in one DC and add those nodes to that DCs property files and not the other DCs property files - the resulting problems aren’t very straightforward to troubleshoot.
> 
> We could try to improve the resilience and fail fast error checking and error reporting of PFS, but honestly, why wouldn’t we deprecate and remove PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be sufficient to replace it?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org