You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/08/01 21:11:01 UTC
[jira] [Commented] (GEODE-3055) data mismatch caused by rebalance. waitUntilFlashed return false

    [ https://issues.apache.org/jira/browse/GEODE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109783#comment-16109783 ] 

ASF GitHub Bot commented on GEODE-3055:
---------------------------------------

Github user gesterzhou commented on a diff in the pull request:

    https://github.com/apache/geode/pull/570#discussion_r130729615
  
    --- Diff: geode-core/src/main/java/org/apache/geode/internal/cache/PartitionedRegionDataStore.java ---
    @@ -1472,6 +1472,19 @@ public boolean removeBucket(int bucketId, boolean forceRemovePrimary) {
           }
     
           BucketAdvisor bucketAdvisor = bucketRegion.getBucketAdvisor();
    +      InternalDistributedMember primary = bucketAdvisor.getPrimary();
    +      InternalDistributedMember myId =
    +          this.partitionedRegion.getDistributionManager().getDistributionManagerId();
    +      if (primary == null || myId.equals(primary)) {
    --- End diff --
    
    The forceRemovePrimary WAS useless and it can be removed because it always use "false".
    
    But when I added the logic to remove the leader region bucket (when the shadow bucket failed to initialize), then I have to call the removeBucket(xxx, true) by myself. 
    
    The shadow bucket was removed first by the exception handling. But I added the logic to remove leader bucket, I have to skip a few "return false" exit points, because at this time the leader bucket is not logically ready and not qualified to be removed unless I force to remove it. 
    
    So I make use of the forceRemovePrimary parameter. Maybe I should change it to better name, such as forceToRemove.


> data mismatch caused by rebalance. waitUntilFlashed return false
> ----------------------------------------------------------------
>
>                 Key: GEODE-3055
>                 URL: https://issues.apache.org/jira/browse/GEODE-3055
>             Project: Geode
>          Issue Type: Bug
>            Reporter: xiaojian zhou
>            Assignee: xiaojian zhou
>              Labels: lucene
>
> /export/buglogs_bvt/xzhou/lucene/concParRegHAPersist-0601-171739
> lucene/concParRegHAPersist.conf
> A=accessor
> B=dataStore
> accessorHosts=1
> accessorThreadsPerVM=5
> accessorVMsPerHost=1
> dataStoreHosts=6
> dataStoreThreadsPerVM=5
> dataStoreVMsPerHost=1
> numVMsToStop=2
> redundantCopies=0
> no local.conf
> In dataStoregemfire5_7483/system.log, thread tid=0xdf, putAll Object_11066
> 17:22:27.135 tid=0xdf] generated tag {v1; rv13 shadowKey=2939
> 17:22:27.136 _partitionedRegionPARALLELGATEWAYSENDER_QUEUE_1 bucket : null // brq is not ready yet
> is enqueued to the tempQueue
> 17:22:27.272 tid=0xdf] generated tag {v3; rv15 shadowKey=3278
> 17:22:33.111 Subregion created: /_PR/_BAsyncEventQueueindex#partitionedRegionPARALLELGATEWAYSENDER_QUEUE_1
> vm_3_dataStore3_r02-s28_28143.log:
> 17:22:33.120 Put successfully in the queue shadowKey= 2939
> 17:22:33.156 tid=0x7fe started query
> 17:22:33.176 Peeked shadowKey= 2939
> So the root cause is: the event is still in tempQueue before it's processed, the query happened. WaitUntilFlush should wait until tempQueue is also flushed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)