You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@solr.apache.org by Mark Miller <ma...@gmail.com> on 2021/09/22 14:39:00 UTC

PerReplicaStatesIntegrationTest

Perhaps I just have a unique test running experience, but this test has
been an outlier failure test in my test runs for months. Given that it’s
newer than most tests, I imagine it’s attention grabbing days are on a
downslope, so here is a poke if someone wants to check out why it often
can’t find the collection it’s currently supposed to be creating.


-- 
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Posted by Mark Miller <ma...@gmail.com>.

>
> I would hope there are few developers doing cloud work that don’t
understand the lazy local cluster state - it’s entirely fundamental to
everything.

The busy waiting, I would less surprised if someone didn’t understand, but
as far as I’m concerned they are bugs too. It’s an event driven system. The
apis exist to properly remove everyone of those busy waits. Technically
they should be forbidden. This is an event driven system. You won’t find
them in any of my branches.

I will say, I don’t know if it’s a tease or comfort that someone can at
least articulate most of the main things that are and have been wrong. If I
had that and someone that saw fixing the majority of the issues to be
correct is relatively easy for most of them, and the time and inclination
to actually do some of it, I’d have something of a path towards ref branch
behavior that didn’t over weigh me.

Once you actually fix all these silly issues, transforming the system
becomes relatively simple instead of a quagmire.


Mark
-- 
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Posted by Mark Miller <ma...@gmail.com>.

David’s issue and my response are referring to the number of zk servers in
the zk cluster. His issue requires more than one zk server. The tests have
always used 1.

Yes the whole system is supposed to work fine with a stale local cache of
what’s in zk. That is the design.  When that doesn’t work, that is a bug.
Either due to a developer not understanding or writing incorrect code.

When a test fails because of a bug, it’s not uncommon for the developers of
the test to throw up their hands and say the test is correct, it’s not my
bug.

It’s not my bug. But I’ll commit the fix.

Mark

On Mon, Sep 27, 2021 at 4:26 AM Ilan Ginzburg <il...@gmail.com> wrote:

> I don't know for the fix to this specific test, but the way cluster state
> is maintained on a node does not depend on how many ZK nodes there are.
>
> When a node does an action against ZK, it does its write to ZK.
> When it needs to read, it reads from its local cache.
> The local cache of the node is updated (following a write) by the ZK watch
> firing, followed by the callback reading the changed bits. So it's always
> some time after the write has completed.
>
> One therefore can't assume a change made is immediately visible on a node,
> no matter the ZK config. That's why code often busy-waits for the update to
> become visible before continuing (common pattern in the Collection API
> commands).
>
> Ilan
>
> On Mon, Sep 27, 2021 at 8:13 AM Mark Miller <ma...@gmail.com> wrote:
>
>> Okay never mind. Somehow I cling to this idea that it’s easier not to get
>> drawn into every test or feature that’s causing me problems, but I have
>> should have known the 30 seconds it takes to address most of these things
>> will easily be dwarfed by the theoretical back and forth over them. I’ll
>> put in the fix for it.
>>
>> Mark
>>
>> On Sun, Sep 26, 2021 at 9:01 PM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> I should also mention, I promise this test can be 100% reliable. It’s
>>> not code I’m going to ramp up on soon though. Also, as I said I may have a
>>> different test experience than others. What tests run together and how
>>> things run will depend on hardware, core count, etc. It’s just the most
>>> common fail I see, and given it’s a new test, they tend to be easier to get
>>> attention on vs old tests.
>>>
>>> The issues itself could be a test problem or a real problem or a real
>>> problem that’s not likely to be seen in production. They run the gambit. At
>>> the moment all I know is that it’s the tests that causes me to have to
>>> rerun the tests the most.
>>>
>>> Mark
>>>
>>>
>>> Mark
>>>
>>> On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> I believe all tests still run with a 1 zk cluster, if still the case,
>>>> zk consistency shouldn’t matter.
>>>>
>>>> It’s been a long while since I’ve looked into that particular
>>>> doc/issue, but even with more than 1 zk instance I believe that is only in
>>>> an issue in a fairly specific case - when a client does something with zk
>>>> and so it assumes it’s done and then triggers something else with the
>>>> assumption the change is made. That something else may not see the change,
>>>> though normally this would require it’s using a different zk client
>>>> instance. Unfortunately, we don’t always currently use a single zk client
>>>> per node, but even still, this is not a normal pattern. Most Solr ZK usage
>>>> should not have an issue with this case as most behavior is driven directly
>>>> by notifications from zookeeper or does not trigger something else with
>>>> this assumption.
>>>>
>>>> Mark
>>>>
>>>> On Sun, Sep 26, 2021 at 8:24 AM David Smiley <ds...@apache.org>
>>>> wrote:
>>>>
>>>>> This drives me crazy too.
>>>>>
>>>>> +1 to Ilan's point.  For a CloudSolrClient, it's state knowledge
>>>>> should merely be a hint and not the final word -- need to go to ZK for
>>>>> that.  For the HTTP based ClusterStateProvider, the receiving Solr side
>>>>> needs to use non-cached information -- must go to ZK always (maybe
>>>>> toggle-able with a param if need be).
>>>>>
>>>>> Still, here's a public service announcement on a guarantee that
>>>>> ZooKeeper does *not* have:
>>>>> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
>>>>> see lack of "Simultaneously Consistent Cross-Client Views" in the note.
>>>>> After reading this (and being shocked by its implications), I added
>>>>> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
>>>>> And I also tried to highlight this... seems maybe not the dev list (I can't
>>>>> find it now) but at least in JIRA somewhere.
>>>>> So maybe all ClusterStateProviders need to ask that a Zk "sync" is
>>>>> called to guarantee the view is up-to-date?  I'm not sure what the cost is
>>>>> but it may be a cost we can't safely avoid.
>>>>>
>>>>> ~ David Smiley
>>>>> Apache Lucene/Solr Search Developer
>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>
>>>>>
>>>>> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <il...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Not sure Gus I would blame the create collection code. To the best of
>>>>>> my recollection, when the create collection call returns the collection IS
>>>>>> fully created.
>>>>>> This doesn't mean though (and that's the problem IMO) that the
>>>>>> cluster state on the node that issued the collection creation call is aware
>>>>>> of it: its cache of cluster state is updated async at a later point once
>>>>>> Zookeeper watches decide it's time).
>>>>>>
>>>>>> I would tend to blame the way cluster state is managed in general in
>>>>>> the cluster.
>>>>>>
>>>>>> I didn't look at this test specifically, so the actual issue might
>>>>>> still be different.
>>>>>>
>>>>>> Ilan
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:
>>>>>>
>>>>>>> why it often can’t find the collection it’s currently supposed to be
>>>>>>>> creating
>>>>>>>
>>>>>>>
>>>>>>> This sounds like things that pestered us while writing TRA tests.
>>>>>>> IIRC the problem basically comes from 2 things: 1) we return from create
>>>>>>> collection before the collection is fully created and ready to use, 2)
>>>>>>> watching code to determine when it IS ready is non-trivial. I think #1 is
>>>>>>> the real problem and #2 is a bandaid that shouldn't be needed.
>>>>>>>
>>>>>>> I think I recall mark previously ranting about how insane and
>>>>>>> terrible it would be if an RDBMS did this with CREATE TABLE...
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>>>>>>> ichattopadhyaya@gmail.com> wrote:
>>>>>>>
>>>>>>>> Sure, Mark.
>>>>>>>> Noble or I will get to this at the earliest, hopefully by end of
>>>>>>>> this week.
>>>>>>>> Unfortunately, I haven't been paying attention to test failures
>>>>>>>> lately.
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Perhaps I just have a unique test running experience, but this
>>>>>>>>> test has been an outlier failure test in my test runs for months. Given
>>>>>>>>> that it’s newer than most tests, I imagine it’s attention grabbing days are
>>>>>>>>> on a downslope, so here is a poke if someone wants to check out why it
>>>>>>>>> often can’t find the collection it’s currently supposed to be creating.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>> http://about.me/markrmiller
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> http://www.needhamsoftware.com (work)
>>>>>>> http://www.the111shift.com (play)
>>>>>>>
>>>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Posted by Ilan Ginzburg <il...@gmail.com>.

I don't know for the fix to this specific test, but the way cluster state
is maintained on a node does not depend on how many ZK nodes there are.

When a node does an action against ZK, it does its write to ZK.
When it needs to read, it reads from its local cache.
The local cache of the node is updated (following a write) by the ZK watch
firing, followed by the callback reading the changed bits. So it's always
some time after the write has completed.

One therefore can't assume a change made is immediately visible on a node,
no matter the ZK config. That's why code often busy-waits for the update to
become visible before continuing (common pattern in the Collection API
commands).

Ilan

On Mon, Sep 27, 2021 at 8:13 AM Mark Miller <ma...@gmail.com> wrote:

> Okay never mind. Somehow I cling to this idea that it’s easier not to get
> drawn into every test or feature that’s causing me problems, but I have
> should have known the 30 seconds it takes to address most of these things
> will easily be dwarfed by the theoretical back and forth over them. I’ll
> put in the fix for it.
>
> Mark
>
> On Sun, Sep 26, 2021 at 9:01 PM Mark Miller <ma...@gmail.com> wrote:
>
>> I should also mention, I promise this test can be 100% reliable. It’s not
>> code I’m going to ramp up on soon though. Also, as I said I may have a
>> different test experience than others. What tests run together and how
>> things run will depend on hardware, core count, etc. It’s just the most
>> common fail I see, and given it’s a new test, they tend to be easier to get
>> attention on vs old tests.
>>
>> The issues itself could be a test problem or a real problem or a real
>> problem that’s not likely to be seen in production. They run the gambit. At
>> the moment all I know is that it’s the tests that causes me to have to
>> rerun the tests the most.
>>
>> Mark
>>
>>
>> Mark
>>
>> On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> I believe all tests still run with a 1 zk cluster, if still the case, zk
>>> consistency shouldn’t matter.
>>>
>>> It’s been a long while since I’ve looked into that particular doc/issue,
>>> but even with more than 1 zk instance I believe that is only in an issue in
>>> a fairly specific case - when a client does something with zk and so it
>>> assumes it’s done and then triggers something else with the assumption the
>>> change is made. That something else may not see the change, though normally
>>> this would require it’s using a different zk client instance.
>>> Unfortunately, we don’t always currently use a single zk client per node,
>>> but even still, this is not a normal pattern. Most Solr ZK usage should not
>>> have an issue with this case as most behavior is driven directly by
>>> notifications from zookeeper or does not trigger something else with this
>>> assumption.
>>>
>>> Mark
>>>
>>> On Sun, Sep 26, 2021 at 8:24 AM David Smiley <ds...@apache.org> wrote:
>>>
>>>> This drives me crazy too.
>>>>
>>>> +1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
>>>> merely be a hint and not the final word -- need to go to ZK for that.  For
>>>> the HTTP based ClusterStateProvider, the receiving Solr side needs to use
>>>> non-cached information -- must go to ZK always (maybe toggle-able with a
>>>> param if need be).
>>>>
>>>> Still, here's a public service announcement on a guarantee that
>>>> ZooKeeper does *not* have:
>>>> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
>>>> see lack of "Simultaneously Consistent Cross-Client Views" in the note.
>>>> After reading this (and being shocked by its implications), I added
>>>> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
>>>> And I also tried to highlight this... seems maybe not the dev list (I can't
>>>> find it now) but at least in JIRA somewhere.
>>>> So maybe all ClusterStateProviders need to ask that a Zk "sync" is
>>>> called to guarantee the view is up-to-date?  I'm not sure what the cost is
>>>> but it may be a cost we can't safely avoid.
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <il...@gmail.com>
>>>> wrote:
>>>>
>>>>> Not sure Gus I would blame the create collection code. To the best of
>>>>> my recollection, when the create collection call returns the collection IS
>>>>> fully created.
>>>>> This doesn't mean though (and that's the problem IMO) that the cluster
>>>>> state on the node that issued the collection creation call is aware of it:
>>>>> its cache of cluster state is updated async at a later point once Zookeeper
>>>>> watches decide it's time).
>>>>>
>>>>> I would tend to blame the way cluster state is managed in general in
>>>>> the cluster.
>>>>>
>>>>> I didn't look at this test specifically, so the actual issue might
>>>>> still be different.
>>>>>
>>>>> Ilan
>>>>>
>>>>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:
>>>>>
>>>>>> why it often can’t find the collection it’s currently supposed to be
>>>>>>> creating
>>>>>>
>>>>>>
>>>>>> This sounds like things that pestered us while writing TRA tests.
>>>>>> IIRC the problem basically comes from 2 things: 1) we return from create
>>>>>> collection before the collection is fully created and ready to use, 2)
>>>>>> watching code to determine when it IS ready is non-trivial. I think #1 is
>>>>>> the real problem and #2 is a bandaid that shouldn't be needed.
>>>>>>
>>>>>> I think I recall mark previously ranting about how insane and
>>>>>> terrible it would be if an RDBMS did this with CREATE TABLE...
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>>>>>> ichattopadhyaya@gmail.com> wrote:
>>>>>>
>>>>>>> Sure, Mark.
>>>>>>> Noble or I will get to this at the earliest, hopefully by end of
>>>>>>> this week.
>>>>>>> Unfortunately, I haven't been paying attention to test failures
>>>>>>> lately.
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Perhaps I just have a unique test running experience, but this test
>>>>>>>> has been an outlier failure test in my test runs for months. Given that
>>>>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are on a
>>>>>>>> downslope, so here is a poke if someone wants to check out why it often
>>>>>>>> can’t find the collection it’s currently supposed to be creating.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> http://about.me/markrmiller
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> http://www.needhamsoftware.com (work)
>>>>>> http://www.the111shift.com (play)
>>>>>>
>>>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: PerReplicaStatesIntegrationTest

Posted by Mark Miller <ma...@gmail.com>.

Okay never mind. Somehow I cling to this idea that it’s easier not to get
drawn into every test or feature that’s causing me problems, but I have
should have known the 30 seconds it takes to address most of these things
will easily be dwarfed by the theoretical back and forth over them. I’ll
put in the fix for it.

Mark

On Sun, Sep 26, 2021 at 9:01 PM Mark Miller <ma...@gmail.com> wrote:

> I should also mention, I promise this test can be 100% reliable. It’s not
> code I’m going to ramp up on soon though. Also, as I said I may have a
> different test experience than others. What tests run together and how
> things run will depend on hardware, core count, etc. It’s just the most
> common fail I see, and given it’s a new test, they tend to be easier to get
> attention on vs old tests.
>
> The issues itself could be a test problem or a real problem or a real
> problem that’s not likely to be seen in production. They run the gambit. At
> the moment all I know is that it’s the tests that causes me to have to
> rerun the tests the most.
>
> Mark
>
>
> Mark
>
> On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <ma...@gmail.com> wrote:
>
>> I believe all tests still run with a 1 zk cluster, if still the case, zk
>> consistency shouldn’t matter.
>>
>> It’s been a long while since I’ve looked into that particular doc/issue,
>> but even with more than 1 zk instance I believe that is only in an issue in
>> a fairly specific case - when a client does something with zk and so it
>> assumes it’s done and then triggers something else with the assumption the
>> change is made. That something else may not see the change, though normally
>> this would require it’s using a different zk client instance.
>> Unfortunately, we don’t always currently use a single zk client per node,
>> but even still, this is not a normal pattern. Most Solr ZK usage should not
>> have an issue with this case as most behavior is driven directly by
>> notifications from zookeeper or does not trigger something else with this
>> assumption.
>>
>> Mark
>>
>> On Sun, Sep 26, 2021 at 8:24 AM David Smiley <ds...@apache.org> wrote:
>>
>>> This drives me crazy too.
>>>
>>> +1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
>>> merely be a hint and not the final word -- need to go to ZK for that.  For
>>> the HTTP based ClusterStateProvider, the receiving Solr side needs to use
>>> non-cached information -- must go to ZK always (maybe toggle-able with a
>>> param if need be).
>>>
>>> Still, here's a public service announcement on a guarantee that
>>> ZooKeeper does *not* have:
>>> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
>>> see lack of "Simultaneously Consistent Cross-Client Views" in the note.
>>> After reading this (and being shocked by its implications), I added
>>> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
>>> And I also tried to highlight this... seems maybe not the dev list (I can't
>>> find it now) but at least in JIRA somewhere.
>>> So maybe all ClusterStateProviders need to ask that a Zk "sync" is
>>> called to guarantee the view is up-to-date?  I'm not sure what the cost is
>>> but it may be a cost we can't safely avoid.
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <il...@gmail.com>
>>> wrote:
>>>
>>>> Not sure Gus I would blame the create collection code. To the best of
>>>> my recollection, when the create collection call returns the collection IS
>>>> fully created.
>>>> This doesn't mean though (and that's the problem IMO) that the cluster
>>>> state on the node that issued the collection creation call is aware of it:
>>>> its cache of cluster state is updated async at a later point once Zookeeper
>>>> watches decide it's time).
>>>>
>>>> I would tend to blame the way cluster state is managed in general in
>>>> the cluster.
>>>>
>>>> I didn't look at this test specifically, so the actual issue might
>>>> still be different.
>>>>
>>>> Ilan
>>>>
>>>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:
>>>>
>>>>> why it often can’t find the collection it’s currently supposed to be
>>>>>> creating
>>>>>
>>>>>
>>>>> This sounds like things that pestered us while writing TRA tests. IIRC
>>>>> the problem basically comes from 2 things: 1) we return from create
>>>>> collection before the collection is fully created and ready to use, 2)
>>>>> watching code to determine when it IS ready is non-trivial. I think #1 is
>>>>> the real problem and #2 is a bandaid that shouldn't be needed.
>>>>>
>>>>> I think I recall mark previously ranting about how insane and terrible
>>>>> it would be if an RDBMS did this with CREATE TABLE...
>>>>>
>>>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>>>>> ichattopadhyaya@gmail.com> wrote:
>>>>>
>>>>>> Sure, Mark.
>>>>>> Noble or I will get to this at the earliest, hopefully by end of this
>>>>>> week.
>>>>>> Unfortunately, I haven't been paying attention to test failures
>>>>>> lately.
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Perhaps I just have a unique test running experience, but this test
>>>>>>> has been an outlier failure test in my test runs for months. Given that
>>>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are on a
>>>>>>> downslope, so here is a poke if someone wants to check out why it often
>>>>>>> can’t find the collection it’s currently supposed to be creating.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> - Mark
>>>>>>>
>>>>>>> http://about.me/markrmiller
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> http://www.needhamsoftware.com (work)
>>>>> http://www.the111shift.com (play)
>>>>>
>>>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Posted by Mark Miller <ma...@gmail.com>.

I should also mention, I promise this test can be 100% reliable. It’s not
code I’m going to ramp up on soon though. Also, as I said I may have a
different test experience than others. What tests run together and how
things run will depend on hardware, core count, etc. It’s just the most
common fail I see, and given it’s a new test, they tend to be easier to get
attention on vs old tests.

The issues itself could be a test problem or a real problem or a real
problem that’s not likely to be seen in production. They run the gambit. At
the moment all I know is that it’s the tests that causes me to have to
rerun the tests the most.

Mark


Mark

On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <ma...@gmail.com> wrote:

> I believe all tests still run with a 1 zk cluster, if still the case, zk
> consistency shouldn’t matter.
>
> It’s been a long while since I’ve looked into that particular doc/issue,
> but even with more than 1 zk instance I believe that is only in an issue in
> a fairly specific case - when a client does something with zk and so it
> assumes it’s done and then triggers something else with the assumption the
> change is made. That something else may not see the change, though normally
> this would require it’s using a different zk client instance.
> Unfortunately, we don’t always currently use a single zk client per node,
> but even still, this is not a normal pattern. Most Solr ZK usage should not
> have an issue with this case as most behavior is driven directly by
> notifications from zookeeper or does not trigger something else with this
> assumption.
>
> Mark
>
> On Sun, Sep 26, 2021 at 8:24 AM David Smiley <ds...@apache.org> wrote:
>
>> This drives me crazy too.
>>
>> +1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
>> merely be a hint and not the final word -- need to go to ZK for that.  For
>> the HTTP based ClusterStateProvider, the receiving Solr side needs to use
>> non-cached information -- must go to ZK always (maybe toggle-able with a
>> param if need be).
>>
>> Still, here's a public service announcement on a guarantee that ZooKeeper
>> does *not* have:
>> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
>> see lack of "Simultaneously Consistent Cross-Client Views" in the note.
>> After reading this (and being shocked by its implications), I added
>> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
>> And I also tried to highlight this... seems maybe not the dev list (I can't
>> find it now) but at least in JIRA somewhere.
>> So maybe all ClusterStateProviders need to ask that a Zk "sync" is called
>> to guarantee the view is up-to-date?  I'm not sure what the cost is but it
>> may be a cost we can't safely avoid.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <il...@gmail.com> wrote:
>>
>>> Not sure Gus I would blame the create collection code. To the best of my
>>> recollection, when the create collection call returns the collection IS
>>> fully created.
>>> This doesn't mean though (and that's the problem IMO) that the cluster
>>> state on the node that issued the collection creation call is aware of it:
>>> its cache of cluster state is updated async at a later point once Zookeeper
>>> watches decide it's time).
>>>
>>> I would tend to blame the way cluster state is managed in general in the
>>> cluster.
>>>
>>> I didn't look at this test specifically, so the actual issue might still
>>> be different.
>>>
>>> Ilan
>>>
>>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:
>>>
>>>> why it often can’t find the collection it’s currently supposed to be
>>>>> creating
>>>>
>>>>
>>>> This sounds like things that pestered us while writing TRA tests. IIRC
>>>> the problem basically comes from 2 things: 1) we return from create
>>>> collection before the collection is fully created and ready to use, 2)
>>>> watching code to determine when it IS ready is non-trivial. I think #1 is
>>>> the real problem and #2 is a bandaid that shouldn't be needed.
>>>>
>>>> I think I recall mark previously ranting about how insane and terrible
>>>> it would be if an RDBMS did this with CREATE TABLE...
>>>>
>>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>>>> ichattopadhyaya@gmail.com> wrote:
>>>>
>>>>> Sure, Mark.
>>>>> Noble or I will get to this at the earliest, hopefully by end of this
>>>>> week.
>>>>> Unfortunately, I haven't been paying attention to test failures lately.
>>>>>
>>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Perhaps I just have a unique test running experience, but this test
>>>>>> has been an outlier failure test in my test runs for months. Given that
>>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are on a
>>>>>> downslope, so here is a poke if someone wants to check out why it often
>>>>>> can’t find the collection it’s currently supposed to be creating.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Posted by Mark Miller <ma...@gmail.com>.

I believe all tests still run with a 1 zk cluster, if still the case, zk
consistency shouldn’t matter.

It’s been a long while since I’ve looked into that particular doc/issue,
but even with more than 1 zk instance I believe that is only in an issue in
a fairly specific case - when a client does something with zk and so it
assumes it’s done and then triggers something else with the assumption the
change is made. That something else may not see the change, though normally
this would require it’s using a different zk client instance.
Unfortunately, we don’t always currently use a single zk client per node,
but even still, this is not a normal pattern. Most Solr ZK usage should not
have an issue with this case as most behavior is driven directly by
notifications from zookeeper or does not trigger something else with this
assumption.

Mark

On Sun, Sep 26, 2021 at 8:24 AM David Smiley <ds...@apache.org> wrote:

> This drives me crazy too.
>
> +1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
> merely be a hint and not the final word -- need to go to ZK for that.  For
> the HTTP based ClusterStateProvider, the receiving Solr side needs to use
> non-cached information -- must go to ZK always (maybe toggle-able with a
> param if need be).
>
> Still, here's a public service announcement on a guarantee that ZooKeeper
> does *not* have:
> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
> see lack of "Simultaneously Consistent Cross-Client Views" in the note.
> After reading this (and being shocked by its implications), I added
> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
> And I also tried to highlight this... seems maybe not the dev list (I can't
> find it now) but at least in JIRA somewhere.
> So maybe all ClusterStateProviders need to ask that a Zk "sync" is called
> to guarantee the view is up-to-date?  I'm not sure what the cost is but it
> may be a cost we can't safely avoid.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <il...@gmail.com> wrote:
>
>> Not sure Gus I would blame the create collection code. To the best of my
>> recollection, when the create collection call returns the collection IS
>> fully created.
>> This doesn't mean though (and that's the problem IMO) that the cluster
>> state on the node that issued the collection creation call is aware of it:
>> its cache of cluster state is updated async at a later point once Zookeeper
>> watches decide it's time).
>>
>> I would tend to blame the way cluster state is managed in general in the
>> cluster.
>>
>> I didn't look at this test specifically, so the actual issue might still
>> be different.
>>
>> Ilan
>>
>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:
>>
>>> why it often can’t find the collection it’s currently supposed to be
>>>> creating
>>>
>>>
>>> This sounds like things that pestered us while writing TRA tests. IIRC
>>> the problem basically comes from 2 things: 1) we return from create
>>> collection before the collection is fully created and ready to use, 2)
>>> watching code to determine when it IS ready is non-trivial. I think #1 is
>>> the real problem and #2 is a bandaid that shouldn't be needed.
>>>
>>> I think I recall mark previously ranting about how insane and terrible
>>> it would be if an RDBMS did this with CREATE TABLE...
>>>
>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>>> ichattopadhyaya@gmail.com> wrote:
>>>
>>>> Sure, Mark.
>>>> Noble or I will get to this at the earliest, hopefully by end of this
>>>> week.
>>>> Unfortunately, I haven't been paying attention to test failures lately.
>>>>
>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Perhaps I just have a unique test running experience, but this test
>>>>> has been an outlier failure test in my test runs for months. Given that
>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are on a
>>>>> downslope, so here is a poke if someone wants to check out why it often
>>>>> can’t find the collection it’s currently supposed to be creating.
>>>>>
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>> --
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Posted by David Smiley <ds...@apache.org>.

This drives me crazy too.

+1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
merely be a hint and not the final word -- need to go to ZK for that.  For
the HTTP based ClusterStateProvider, the receiving Solr side needs to use
non-cached information -- must go to ZK always (maybe toggle-able with a
param if need be).

Still, here's a public service announcement on a guarantee that ZooKeeper
does *not* have:
https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
see lack of "Simultaneously Consistent Cross-Client Views" in the note.
After reading this (and being shocked by its implications), I added
https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
And I also tried to highlight this... seems maybe not the dev list (I can't
find it now) but at least in JIRA somewhere.
So maybe all ClusterStateProviders need to ask that a Zk "sync" is called
to guarantee the view is up-to-date?  I'm not sure what the cost is but it
may be a cost we can't safely avoid.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <il...@gmail.com> wrote:

> Not sure Gus I would blame the create collection code. To the best of my
> recollection, when the create collection call returns the collection IS
> fully created.
> This doesn't mean though (and that's the problem IMO) that the cluster
> state on the node that issued the collection creation call is aware of it:
> its cache of cluster state is updated async at a later point once Zookeeper
> watches decide it's time).
>
> I would tend to blame the way cluster state is managed in general in the
> cluster.
>
> I didn't look at this test specifically, so the actual issue might still
> be different.
>
> Ilan
>
> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:
>
>> why it often can’t find the collection it’s currently supposed to be
>>> creating
>>
>>
>> This sounds like things that pestered us while writing TRA tests. IIRC
>> the problem basically comes from 2 things: 1) we return from create
>> collection before the collection is fully created and ready to use, 2)
>> watching code to determine when it IS ready is non-trivial. I think #1 is
>> the real problem and #2 is a bandaid that shouldn't be needed.
>>
>> I think I recall mark previously ranting about how insane and terrible it
>> would be if an RDBMS did this with CREATE TABLE...
>>
>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>> ichattopadhyaya@gmail.com> wrote:
>>
>>> Sure, Mark.
>>> Noble or I will get to this at the earliest, hopefully by end of this
>>> week.
>>> Unfortunately, I haven't been paying attention to test failures lately.
>>>
>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> Perhaps I just have a unique test running experience, but this test has
>>>> been an outlier failure test in my test runs for months. Given that it’s
>>>> newer than most tests, I imagine it’s attention grabbing days are on a
>>>> downslope, so here is a poke if someone wants to check out why it often
>>>> can’t find the collection it’s currently supposed to be creating.
>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: PerReplicaStatesIntegrationTest

Posted by Ilan Ginzburg <il...@gmail.com>.

Not sure Gus I would blame the create collection code. To the best of my
recollection, when the create collection call returns the collection IS
fully created.
This doesn't mean though (and that's the problem IMO) that the cluster
state on the node that issued the collection creation call is aware of it:
its cache of cluster state is updated async at a later point once Zookeeper
watches decide it's time).

I would tend to blame the way cluster state is managed in general in the
cluster.

I didn't look at this test specifically, so the actual issue might still be
different.

Ilan

On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <gu...@gmail.com> wrote:

> why it often can’t find the collection it’s currently supposed to be
>> creating
>
>
> This sounds like things that pestered us while writing TRA tests. IIRC the
> problem basically comes from 2 things: 1) we return from create collection
> before the collection is fully created and ready to use, 2) watching code
> to determine when it IS ready is non-trivial. I think #1 is the real
> problem and #2 is a bandaid that shouldn't be needed.
>
> I think I recall mark previously ranting about how insane and terrible it
> would be if an RDBMS did this with CREATE TABLE...
>
> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
>> Sure, Mark.
>> Noble or I will get to this at the earliest, hopefully by end of this
>> week.
>> Unfortunately, I haven't been paying attention to test failures lately.
>>
>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> Perhaps I just have a unique test running experience, but this test has
>>> been an outlier failure test in my test runs for months. Given that it’s
>>> newer than most tests, I imagine it’s attention grabbing days are on a
>>> downslope, so here is a poke if someone wants to check out why it often
>>> can’t find the collection it’s currently supposed to be creating.
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: PerReplicaStatesIntegrationTest

Posted by Gus Heck <gu...@gmail.com>.

>
> why it often can’t find the collection it’s currently supposed to be
> creating

This sounds like things that pestered us while writing TRA tests. IIRC the
problem basically comes from 2 things: 1) we return from create collection
before the collection is fully created and ready to use, 2) watching code
to determine when it IS ready is non-trivial. I think #1 is the real
problem and #2 is a bandaid that shouldn't be needed.

I think I recall mark previously ranting about how insane and terrible it
would be if an RDBMS did this with CREATE TABLE...

On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> Sure, Mark.
> Noble or I will get to this at the earliest, hopefully by end of this week.
> Unfortunately, I haven't been paying attention to test failures lately.
>
> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Perhaps I just have a unique test running experience, but this test has
>> been an outlier failure test in my test runs for months. Given that it’s
>> newer than most tests, I imagine it’s attention grabbing days are on a
>> downslope, so here is a poke if someone wants to check out why it often
>> can’t find the collection it’s currently supposed to be creating.
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: PerReplicaStatesIntegrationTest

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.

Sure, Mark.
Noble or I will get to this at the earliest, hopefully by end of this week.
Unfortunately, I haven't been paying attention to test failures lately.

On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <ma...@gmail.com> wrote:

> Perhaps I just have a unique test running experience, but this test has
> been an outlier failure test in my test runs for months. Given that it’s
> newer than most tests, I imagine it’s attention grabbing days are on a
> downslope, so here is a poke if someone wants to check out why it often
> can’t find the collection it’s currently supposed to be creating.
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>