You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/11/16 08:18:06 UTC
[GitHub] [incubator-druid] ArvinZheng opened a new issue #8882: equals method in SegmentId looks buggy

ArvinZheng opened a new issue #8882: equals method in SegmentId looks buggy
URL: https://github.com/apache/incubator-druid/issues/8882
 
 
   Coordinator fails to distribute segments and suspect the equals method in SegmentId caused this issue.
   ```
     private int computeHashCode()
     {
       // Start with partitionNum and version hash codes, because they are often little sequential numbers. If they are
       // added in the end of the chain, resulting hashCode of SegmentId could have worse distribution.
       int hashCode = partitionNum;
       // 1000003 is a constant used in Google AutoValue, provides a little better distribution than 31
       hashCode = hashCode * 1000003 + version.hashCode();
   
       hashCode = hashCode * 1000003 + dataSource.hashCode();
       hashCode = hashCode * 1000003 + Long.hashCode(intervalStartMillis);
       hashCode = hashCode * 1000003 + Long.hashCode(intervalEndMillis);
       hashCode = hashCode * 1000003 + Objects.hashCode(intervalChronology);
       return hashCode;
     }
   
     @Override
     public boolean equals(Object o)
     {
       if (this == o) {
         return true;
       }
       if (!(o instanceof SegmentId)) {
         return false;
       }
       SegmentId that = (SegmentId) o;
       // Compare hashCode instead of partitionNum: break the chain quicker if the objects are not equal. If the hashCodes
       // are equal as well as all other fields used to compute them, the partitionNums are also guaranteed to be equal.
       return hashCode == that.hashCode &&
              dataSource.equals(that.dataSource) &&
              intervalStartMillis == that.intervalStartMillis &&
              intervalEndMillis == that.intervalEndMillis &&
              Objects.equals(intervalChronology, that.intervalChronology) &&
              version.equals(that.version);
     }
   ```
   
   ### Affected Version
   0.16.0
   
   ### Description
   
   We got `exceptionMessage=Can't add chunk[org.apache.druid.timeline.partition.LinearPartitionChunk@1cb644f] to a full atomicUpdateGroup[AtomicUpdateGroup{chunks=[org.apache.druid.timeline.partition.LinearPartitionChunk@9a644f]}]}` error from **Coordinator** when we were trying upgrading our cluster from 0.13.0 to 0.16.0, refer **Stack trace** for details. 
   To give more contexts here, we are using our own indexer which reads data from Kafka and builds segments. We use LinearShardSpec for creating our segments and maintain an internal logic to generate the partition number base on our data distribution strategy, the generated partition number per interval is not that consecutive, e.g. following partition numbers are generated for the same interval, and we may produce 2000 to 4000 segments in our production.
   `10112217, 100412240, 110112589, 120212564, 20110920, 30212707, 40112686, 50312650, 60212319, 70412079, 80212020, 90312664`
   
   - Cluster size
   We found this on our dev cluster which has only 3 historical nodes, 1 query broker and 1 coordinator.
   
   - Steps to reproduce the problem
   Not able to reproduce the problem since we found this issue when we were trying to upgrade our cluster to 0.16.0 and the segments have been cleaned up.
   
   - Any debugging that you have already done
   1. Went through `VersionedIntervalTimeline`, `OvershadowableManager`, `AtomicUpdateGroup` and `RootPartitionRange` classes, for the same interval, version, we should not have multiple segments which have `LinearShardSpec` belong to the same `AtomicUpdateGroup`, since the `RootPartitionRange` for `LinearShardSpec` is always (partitionNum, partitionNum+1), we should never have same partition number for the same interval (guaranteed by DB table constraint). 
   2. In `doPoll()` method of `SQLMetadataSegmentManager`, every time when we query DB, we try to replace a segment by an existing one in previous snapshot if we can find it by `SegmentId`. If there is any collision can be occurred for 2 different partition numbers (the rest factors are the same) from `computeHashCode()` method of `SegmentId` class, our segment poll is broken.
   3. Unfortunately we couldn't reproduce the issue, but I still don't feel comfortable to use a hash for a equals method of an Java Object, especially when it's being used as a key-lookup, in my poor opinion, comparing partitionNum there would be better (considering performance vs correctness, and I am not sure how much performance we could improve by comparing the hash).
   
   - Stack trace
   ```
   2019-11-13T01:43:23,526 ERROR [org.apache.druid.metadata.SQLMetadataSegmentManager-Exec--0] org.apache.druid.metadata.SQLMetadataSegmentManager - Uncaught exception in class org.apache.druid.metadata.SQLMetadataSegmentManager's polling thread: {class=org.apache.druid.metadata.SQLMetadataSegmentManager, exceptionType=class org.apache.druid.java.util.common.ISE, exceptionMessage=Can't add chunk[org.apache.druid.timeline.partition.LinearPartitionChunk@1cb644f] to a full atomicUpdateGroup[AtomicUpdateGroup{chunks=[org.apache.druid.timeline.partition.LinearPartitionChunk@9a644f]}]}
   org.apache.druid.java.util.common.ISE: Can't add chunk[org.apache.druid.timeline.partition.LinearPartitionChunk@1cb644f] to a full atomicUpdateGroup[AtomicUpdateGroup{chunks=[org.apache.druid.timeline.partition.LinearPartitionChunk@9a644f]}]
       at org.apache.druid.timeline.partition.OvershadowableManager.addChunk(OvershadowableManager.java:656) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.timeline.partition.PartitionHolder.add(PartitionHolder.java:61) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.timeline.VersionedIntervalTimeline.addAll(VersionedIntervalTimeline.java:188) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.timeline.VersionedIntervalTimeline.addSegments(VersionedIntervalTimeline.java:116) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.timeline.VersionedIntervalTimeline.forSegments(VersionedIntervalTimeline.java:107) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.timeline.VersionedIntervalTimeline.forSegments(VersionedIntervalTimeline.java:100) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.client.DataSourcesSnapshot.lambda$new$3(DataSourcesSnapshot.java:84) ~[druid-server-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.utils.CollectionUtils.lambda$mapValues$0(CollectionUtils.java:96) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at java.util.HashMap.forEach(HashMap.java:1289) ~[?:1.8.0_181]
       at org.apache.druid.utils.CollectionUtils.mapValues(CollectionUtils.java:96) ~[druid-core-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.client.DataSourcesSnapshot.<init>(DataSourcesSnapshot.java:82) ~[druid-server-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.client.DataSourcesSnapshot.fromUsedSegments(DataSourcesSnapshot.java:54) ~[druid-server-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.metadata.SQLMetadataSegmentManager.doPoll(SQLMetadataSegmentManager.java:973) ~[druid-server-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.metadata.SQLMetadataSegmentManager.poll(SQLMetadataSegmentManager.java:905) ~[druid-server-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at org.apache.druid.metadata.SQLMetadataSegmentManager.lambda$createPollTaskForStartOrder$0(SQLMetadataSegmentManager.java:321) ~[druid-server-0.16.0-incubating-iap3.jar:0.16.0-incubating-iap3]
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_181]
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_181]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
   ```
   cc: @michaelschiff 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org