You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Gerlowski (JIRA)" <ji...@apache.org> on 2018/12/12 13:58:00 UTC

[jira] [Comment Edited] (SOLR-13037) Harden TestSimGenericDistributedQueue.

    [ https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719000#comment-16719000 ] 

Jason Gerlowski edited comment on SOLR-13037 at 12/12/18 1:57 PM:
------------------------------------------------------------------

To (hopefully) explain things a little more clearly, here's the race condition I think we're running into here.  There's a few sections of {{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one in particular.  Check out TestSimDistributedQueue lines 73-74:

{code}
 (new QueueChangerThread(dq,1000)).start();
 assertNotNull(dq.peek(15000));
{code}

This test code has two threads of interest. The QueueChangerThread we see created here will sleep for one second, and then insert data into the queue. Meanwhile the main test thread will wait for some data to be inserted into the queue. Our queue-reading waits a pretty generous amount of time for things to enter the queue, so the insert should always finish in time and the read should always pick it up.

Some more detail on the operation of each queue operation happens. First the queue-write (i.e. {{offer()}}):
 - [Acquire lock 'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
 - [Create queue entry node and attach it to parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
 - [Wake up any threads sleeping on the 'changed' Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
 - [Release lock 'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
 - [Set data for queue entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]

Now the queue-read. Queue-reading works off of a cache of "known queue entries" and most queue-reads are handled from there. But the test failure only occurs when we need to refresh this cache and read straight from ZK, so I'll skip the cache logic here.
 - [Acquire lock 'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
 - [loop until we're out of time to wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
 ** [look for an element and return if non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
 ** [sleep until we receive a wakeup  from 'changed' Condition or we time out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
 - [Release lock 'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]

There's a problem with the queue-write code above.  We wake up threads after creating the queue-entry, but before it's fully initialized with its data.  This opens the door to readers seeing the data before it's fully ready and going back to sleep.  The 'changed' signalling has already happened, so any readers that see the data too early will go back to sleep and not wake up again until timeout.

There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how we already trigger a watcher on "child-added"  
 


was (Author: gerlowskija):
To (hopefully) explain things a little more clearly, here's the race condition I think we're running into here.  There's a few sections of {{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one in particular.  Check out TestSimDistributedQueue lines 73-74:

(code}
 (new QueueChangerThread(dq,1000)).start();
 assertNotNull(dq.peek(15000));
{code}

This test code has two threads of interest. The QueueChangerThread we see created here will sleep for one second, and then insert data into the queue. Meanwhile the main test thread will wait for some data to be inserted into the queue. Our queue-reading waits a pretty generous amount of time for things to enter the queue, so the insert should always finish in time and the read should always pick it up.

Some more detail on the operation of each queue operation happens. First the queue-write (i.e. {{offer()}}):
 - [Acquire lock 'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
 - [Create queue entry node and attach it to parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
 - [Wake up any threads sleeping on the 'changed' Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
 - [Release lock 'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
 - [Set data for queue entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]

Now the queue-read. Queue-reading works off of a cache of "known queue entries" and most queue-reads are handled from there. But the test failure only occurs when we need to refresh this cache and read straight from ZK, so I'll skip the cache logic here.
 - [Acquire lock 'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
 - [loop until we're out of time to wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
 ** [look for an element and return if non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
 ** [sleep until we receive a wakeup  from 'changed' Condition or we time out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
 - [Release lock 'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]

There's a problem with the queue-write code above.  We wake up threads after creating the queue-entry, but before it's fully initialized with its data.  This opens the door to readers seeing the data before it's fully ready and going back to sleep.  The 'changed' signalling has already happened, so any readers that see the data too early will go back to sleep and not wake up again until timeout.

There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how we already trigger a watcher on "child-added"  
 

> Harden TestSimGenericDistributedQueue.
> --------------------------------------
>
>                 Key: SOLR-13037
>                 URL: https://issues.apache.org/jira/browse/SOLR-13037
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Tests
>            Reporter: Mark Miller
>            Assignee: Jason Gerlowski
>            Priority: Major
>         Attachments: repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org