You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/07/15 22:55:16 UTC

[GitHub] [pinot] deemoliu opened a new pull request, #9062: [WIP] [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

deemoliu opened a new pull request, #9062:
URL: https://github.com/apache/pinot/pull/9062

   Instructions:
   1. The PR has to be tagged with at least one of the following labels (*):
      1. `feature`
      2. `bugfix`
      3. `performance`
      4. `ui`
      5. `backward-incompat`
      6. `release-notes` (**)
   2. Remove these instructions before publishing the PR.
    
   (*) Other labels to consider:
   - `testing`
   - `dependencies`
   - `docker`
   - `kubernetes`
   - `observability`
   - `security`
   - `code-style`
   - `extension-point`
   - `refactor`
   - `cleanup`
   
   (**) Use `release-notes` label for scenarios like:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r935859101


##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/UpsertConfig.java:
##########
@@ -55,12 +55,16 @@ public enum Strategy {
   @JsonPropertyDescription("Column for upsert comparison, default to time column")
   private final String _comparisonColumn;
 
+  @JsonPropertyDescription("whether use snapshot for fast upsert metadata recovery")
+  private boolean _useSnapshot;
+
   @JsonCreator
   public UpsertConfig(@JsonProperty(value = "mode", required = true) Mode mode,

Review Comment:
   thanks @Jackie-Jiang. Will do it and rebase. 
   
   one quick question, will this affect backward compatibility?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r928003568


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java:
##########
@@ -198,6 +201,13 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo
     // Create a temporary directory used in segment creation
     _tempIndexDir = new File(indexDir, "tmp-" + UUID.randomUUID());
     LOGGER.debug("tempIndexDir:{}", _tempIndexDir);
+
+    _upsertSnapshotEnabled = _config.isUpsertSnapshotEnabled();

Review Comment:
   Thanks @Jackie-Jiang i planned to create another PR for TTL-related config and logics. Do you think I should add that in the PR?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on PR #9062:
URL: https://github.com/apache/pinot/pull/9062#issuecomment-1190534594

   > Is there a reason we can't enable this by default and need a config?
   
   Hi @KKcorps thanks for reviewing. I think `snapshotEnabled` can be used as a config field similiar to "nullHandlingEnabled".   It will not affect the backward compatibility.
   Also i am think about refactor this config once the TTL config is added, so it will be by default enabled for all the upsert TTL use cases (since it's required).
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] yupeng9 commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
yupeng9 commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r929396638


##########
pinot-integration-test-base/src/test/java/org/apache/pinot/integration/tests/BaseClusterIntegrationTest.java:
##########
@@ -348,7 +348,7 @@ protected Map<String, String> getStreamConfigMap() {
       streamConfigMap.put(KafkaStreamConfigProperties.constructStreamProperty(
           KafkaStreamConfigProperties.HighLevelConsumer.KAFKA_HLC_ZK_CONNECTION_STRING), getKafkaZKAddress());
       streamConfigMap.put(KafkaStreamConfigProperties.constructStreamProperty(
-              KafkaStreamConfigProperties.HighLevelConsumer.KAFKA_HLC_BOOTSTRAP_SERVER),
+          KafkaStreamConfigProperties.HighLevelConsumer.KAFKA_HLC_BOOTSTRAP_SERVER),

Review Comment:
   curious, do you use the right formatter?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +214,13 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocIds(ImmutableRoaringBitmap validDocIds) {
+    for (int docId: validDocIds.toArray()) {
+      _validDocIds.add(docId);

Review Comment:
   this is not set, but add?



##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/OfflineClusterIntegrationTest.java:
##########
@@ -142,7 +142,7 @@ public class OfflineClusterIntegrationTest extends BaseClusterIntegrationTestSet
   private static final String MAX_NUM_MULTI_VALUES_MAP_KEY = "maxNumMultiValuesMap";
   // TODO: This might lead to flaky test, as this disk size is not deterministic
   //       as it depends on the iteration order of a HashSet.
-  private static final int DISK_SIZE_IN_BYTES = 20796000;
+  private static final int DISK_SIZE_IN_BYTES = 20796348;

Review Comment:
   why change on this?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java:
##########
@@ -198,6 +201,13 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo
     // Create a temporary directory used in segment creation
     _tempIndexDir = new File(indexDir, "tmp-" + UUID.randomUUID());
     LOGGER.debug("tempIndexDir:{}", _tempIndexDir);
+
+    _upsertSnapshotEnabled = _config.isUpsertSnapshotEnabled();

Review Comment:
   Should we do this in `build` method? I think this shall be enabled with TTL, as it applies only to the segments outside TTL, otherwise the valid doc id will keep changing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r935906982


##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/UpsertConfig.java:
##########
@@ -55,12 +55,16 @@ public enum Strategy {
   @JsonPropertyDescription("Column for upsert comparison, default to time column")
   private final String _comparisonColumn;
 
+  @JsonPropertyDescription("whether use snapshot for fast upsert metadata recovery")
+  private boolean _useSnapshot;
+
   @JsonCreator
   public UpsertConfig(@JsonProperty(value = "mode", required = true) Mode mode,

Review Comment:
   We may deprecate the current constructor and add an empty constructor to keep backward compatibility



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on pull request #9062: [WIP] [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on PR #9062:
URL: https://github.com/apache/pinot/pull/9062#issuecomment-1186030330

   Will fix the checkstyle, test failures, and add unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r929421458


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +214,13 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocIds(ImmutableRoaringBitmap validDocIds) {
+    for (int docId: validDocIds.toArray()) {
+      _validDocIds.add(docId);

Review Comment:
   Thanks @yupeng9 
   
   Previously i tried to set the validDocIndex directly from snapshot. so i iterate validDocId from immutableRoaringBitmap to ThreadSafeRoaringBitmap. I synced with Jackie and learned this is not correct, so I fix it in `[this commit](https://github.com/apache/pinot/pull/9062/commits/56fb2a6b629473f5a64eb945597092854a39c3ba)`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
KKcorps commented on PR #9062:
URL: https://github.com/apache/pinot/pull/9062#issuecomment-1190024447

   Is there a reason we can't enabled this by default and need a config? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
KKcorps commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r925370537


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -215,6 +216,13 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocIds(ImmutableRoaringBitmap validDocIds) {
+    for (int docId: validDocIds.toArray()) {
+      _validDocIds.add(docId);

Review Comment:
   We can use `MutableRoaringBitmap.add(validDocIds, start, end)` operation here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on PR #9062:
URL: https://github.com/apache/pinot/pull/9062#issuecomment-1196014305

   @yupeng9 @Jackie-Jiang Do you have any suggestion on when to persist snapshot? I looked into partitionUpsertMetaDataManager and summarized the [scenarios for persisting snapshot](https://docs.google.com/document/d/1guYCag3VoE2148oAj8v1k714qmuhrKI1Z1HNJt2BLqQ/edit#)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on PR #9062:
URL: https://github.com/apache/pinot/pull/9062#issuecomment-1194735021

   @Jackie-Jiang pointed that there is a potential issue cause inconsistent query result.
   #9095 #9101
   
   when reloading the segments, the invalidDocsIndexes can be empty (drop and recreate) for a short period of time. If the persisting snapshot triggered during that time, we will persist empty snapshot and get wrong data. will need to rebase the change and then walk through corner cases when persist the snapshot.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r935855593


##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/metadata/SegmentMetadataImpl.java:
##########
@@ -403,6 +408,10 @@ public Map<String, ColumnMetadata> getColumnMetadataMap() {
     return _columnMetadataMap;
   }
 
+  public boolean isUpsertSnapshotEnabled() {

Review Comment:
   Gotcha, thanks for the suggestion! let me address this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #9062: [WIP] [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #9062:
URL: https://github.com/apache/pinot/pull/9062#issuecomment-1188362449

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/9062?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#9062](https://codecov.io/gh/apache/pinot/pull/9062?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8e7f2a9) into [master](https://codecov.io/gh/apache/pinot/commit/83410f69b49988f4a1c1f0f373c8bafcd25404ba?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (83410f6) will **decrease** coverage by `54.65%`.
   > The diff coverage is `0.00%`.
   
   > :exclamation: Current head 8e7f2a9 differs from pull request most recent head 6a3697b. Consider uploading reports for the commit 6a3697b to get more accurate results
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #9062       +/-   ##
   =============================================
   - Coverage     70.09%   15.43%   -54.66%     
   + Complexity     4741      170     -4571     
   =============================================
     Files          1831     1785       -46     
     Lines         96428    94380     -2048     
     Branches      14413    14178      -235     
   =============================================
   - Hits          67591    14571    -53020     
   - Misses        24169    78765    +54596     
   + Partials       4668     1044     -3624     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `?` | |
   | unittests2 | `15.43% <0.00%> (+0.02%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/9062?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...ata/manager/realtime/RealtimeTableDataManager.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3QtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3QvY29yZS9kYXRhL21hbmFnZXIvcmVhbHRpbWUvUmVhbHRpbWVUYWJsZURhdGFNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (-69.82%)` | :arrow_down: |
   | [...l/indexsegment/immutable/ImmutableSegmentImpl.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9pbmRleHNlZ21lbnQvaW1tdXRhYmxlL0ltbXV0YWJsZVNlZ21lbnRJbXBsLmphdmE=) | `0.00% <0.00%> (-57.32%)` | :arrow_down: |
   | [...indexsegment/immutable/ImmutableSegmentLoader.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9pbmRleHNlZ21lbnQvaW1tdXRhYmxlL0ltbXV0YWJsZVNlZ21lbnRMb2FkZXIuamF2YQ==) | `0.00% <0.00%> (-85.46%)` | :arrow_down: |
   | [...ment/creator/impl/SegmentColumnarIndexCreator.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9TZWdtZW50Q29sdW1uYXJJbmRleENyZWF0b3IuamF2YQ==) | `0.00% <0.00%> (-78.86%)` | :arrow_down: |
   | [...t/creator/impl/SegmentIndexCreationDriverImpl.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9TZWdtZW50SW5kZXhDcmVhdGlvbkRyaXZlckltcGwuamF2YQ==) | `0.00% <0.00%> (-83.73%)` | :arrow_down: |
   | [.../creator/impl/upsert/ValidDocsSnapshotCreator.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC91cHNlcnQvVmFsaWREb2NzU25hcHNob3RDcmVhdG9yLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...l/upsert/validdocs/ValidDocsSnapshotContainer.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91cHNlcnQvdmFsaWRkb2NzL1ZhbGlkRG9jc1NuYXBzaG90Q29udGFpbmVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...java/org/apache/pinot/segment/spi/V1Constants.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL1YxQ29uc3RhbnRzLmphdmE=) | `0.00% <ø> (-12.50%)` | :arrow_down: |
   | [...ot/segment/spi/creator/SegmentGeneratorConfig.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL2NyZWF0b3IvU2VnbWVudEdlbmVyYXRvckNvbmZpZy5qYXZh) | `0.00% <0.00%> (-82.50%)` | :arrow_down: |
   | [...egment/spi/index/metadata/SegmentMetadataImpl.java](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL2luZGV4L21ldGFkYXRhL1NlZ21lbnRNZXRhZGF0YUltcGwuamF2YQ==) | `0.00% <0.00%> (-73.37%)` | :arrow_down: |
   | ... and [1396 more](https://codecov.io/gh/apache/pinot/pull/9062/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/9062?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/9062?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a13dc03...6a3697b](https://codecov.io/gh/apache/pinot/pull/9062?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r929417448


##########
pinot-integration-test-base/src/test/java/org/apache/pinot/integration/tests/BaseClusterIntegrationTest.java:
##########
@@ -348,7 +348,7 @@ protected Map<String, String> getStreamConfigMap() {
       streamConfigMap.put(KafkaStreamConfigProperties.constructStreamProperty(
           KafkaStreamConfigProperties.HighLevelConsumer.KAFKA_HLC_ZK_CONNECTION_STRING), getKafkaZKAddress());
       streamConfigMap.put(KafkaStreamConfigProperties.constructStreamProperty(
-              KafkaStreamConfigProperties.HighLevelConsumer.KAFKA_HLC_BOOTSTRAP_SERVER),
+          KafkaStreamConfigProperties.HighLevelConsumer.KAFKA_HLC_BOOTSTRAP_SERVER),

Review Comment:
   @yupeng9 I re-imported but the codestyle seems has slight difference between each intellij versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r928003568


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java:
##########
@@ -198,6 +201,13 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo
     // Create a temporary directory used in segment creation
     _tempIndexDir = new File(indexDir, "tmp-" + UUID.randomUUID());
     LOGGER.debug("tempIndexDir:{}", _tempIndexDir);
+
+    _upsertSnapshotEnabled = _config.isUpsertSnapshotEnabled();

Review Comment:
   Thanks @Jackie-Jiang i planned to create another PR for TTL-related config and logics. Do you think I should add that in the same PR?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r935859820


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +215,16 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocSnapshots(ImmutableRoaringBitmap validDocSnapshots) {

Review Comment:
   gotcha, thanks for pointing out, let me address this.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +215,16 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocSnapshots(ImmutableRoaringBitmap validDocSnapshots) {
+    _validDocSnapshots = validDocSnapshots;
+  }
+
+  @Nullable
+  public ImmutableRoaringBitmap getValidDocSnapshots() {

Review Comment:
   gotcha, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r925983121


##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -183,8 +183,8 @@ protected void doInit() {
       Preconditions.checkState(schema != null, "Failed to find schema for table: %s", _tableNameWithType);
 
       _primaryKeyColumns = schema.getPrimaryKeyColumns();
-      Preconditions.checkState(!CollectionUtils.isEmpty(_primaryKeyColumns),
-          "Primary key columns must be configured for dedup");
+      Preconditions

Review Comment:
   Can you please re-apply the [Pinot Style](https://docs.pinot.apache.org/developers/developers-and-contributors/code-setup#setup-ide)?
   I don't know why you get different auto-formatting results, but these changes are not related to the PR, and make it quite hard to review. Let's also revert the unrelated auto-formatting changes in other classes



##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -416,15 +415,18 @@ private void buildDedupMeta(ImmutableSegmentImpl immutableSegment) {
 
   private void handleUpsert(ImmutableSegmentImpl immutableSegment) {
     String segmentName = immutableSegment.getSegmentName();
-    Integer partitionGroupId =
-        SegmentUtils.getRealtimeSegmentPartitionId(segmentName, _tableNameWithType, _helixManager,
-            _primaryKeyColumns.get(0));
-    Preconditions.checkNotNull(partitionGroupId,
-        String.format("PartitionGroupId is not available for segment: '%s' (upsert-enabled table: %s)", segmentName,
+    Integer partitionGroupId = SegmentUtils
+        .getRealtimeSegmentPartitionId(segmentName, _tableNameWithType, _helixManager, _primaryKeyColumns.get(0));
+    Preconditions.checkNotNull(partitionGroupId, String
+        .format("PartitionGroupId is not available for segment: '%s' (upsert-enabled table: %s)", segmentName,
             _tableNameWithType));
     PartitionUpsertMetadataManager partitionUpsertMetadataManager =
         _tableUpsertMetadataManager.getOrCreatePartitionManager(partitionGroupId);
+
     ThreadSafeMutableRoaringBitmap validDocIds = new ThreadSafeMutableRoaringBitmap();
+    if (immutableSegment.getValidDocIds() != null) {

Review Comment:
   I don't think this is the correct way to handle it. We should get the persisted valid doc id, and iterate over it to get the `recordInfoIterator`



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java:
##########
@@ -198,6 +201,13 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo
     // Create a temporary directory used in segment creation
     _tempIndexDir = new File(indexDir, "tmp-" + UUID.randomUUID());
     LOGGER.debug("tempIndexDir:{}", _tempIndexDir);
+
+    _upsertSnapshotEnabled = _config.isUpsertSnapshotEnabled();

Review Comment:
   We don't have the valid doc id info during segment creation. The snapshot has to be taken periodically during consumption. In the current implementation, no valid doc is added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r929419135


##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/OfflineClusterIntegrationTest.java:
##########
@@ -142,7 +142,7 @@ public class OfflineClusterIntegrationTest extends BaseClusterIntegrationTestSet
   private static final String MAX_NUM_MULTI_VALUES_MAP_KEY = "maxNumMultiValuesMap";
   // TODO: This might lead to flaky test, as this disk size is not deterministic
   //       as it depends on the iteration order of a HashSet.
-  private static final int DISK_SIZE_IN_BYTES = 20796000;
+  private static final int DISK_SIZE_IN_BYTES = 20796348;

Review Comment:
   thanks @yupeng9 for review. 
   I added "segment.use.ttl" properties in the metadata.
   When load from disk, I use this properties from `segmentMetadataPropertiesConfiguration` to initialize [validDocSnapshot](https://github.com/apache/pinot/pull/9062/commits/6bfab718afadfaa7139de227dd27edf7d0a44b9e)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r928002601


##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -416,15 +415,18 @@ private void buildDedupMeta(ImmutableSegmentImpl immutableSegment) {
 
   private void handleUpsert(ImmutableSegmentImpl immutableSegment) {
     String segmentName = immutableSegment.getSegmentName();
-    Integer partitionGroupId =
-        SegmentUtils.getRealtimeSegmentPartitionId(segmentName, _tableNameWithType, _helixManager,
-            _primaryKeyColumns.get(0));
-    Preconditions.checkNotNull(partitionGroupId,
-        String.format("PartitionGroupId is not available for segment: '%s' (upsert-enabled table: %s)", segmentName,
+    Integer partitionGroupId = SegmentUtils
+        .getRealtimeSegmentPartitionId(segmentName, _tableNameWithType, _helixManager, _primaryKeyColumns.get(0));
+    Preconditions.checkNotNull(partitionGroupId, String
+        .format("PartitionGroupId is not available for segment: '%s' (upsert-enabled table: %s)", segmentName,
             _tableNameWithType));
     PartitionUpsertMetadataManager partitionUpsertMetadataManager =
         _tableUpsertMetadataManager.getOrCreatePartitionManager(partitionGroupId);
+
     ThreadSafeMutableRoaringBitmap validDocIds = new ThreadSafeMutableRoaringBitmap();
+    if (immutableSegment.getValidDocIds() != null) {

Review Comment:
   Gotcha, thanks for review.
   For immutableSegmentImpl, it already has `validDocsIds` variable for keeping in memory states, I was think of loading the validDocsSnapshots from disk to `validDocsIds` variable directly. Is there any concern/risk on this?
   I can also add another variable call validDocsSnapshots that loaded from disk. Iterate over validDocsSnapshots to get the recordInfoInteractor and re-generate validDocIds in PartitionUpsertMetadataManager (addSegment).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r929428348


##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -416,15 +415,18 @@ private void buildDedupMeta(ImmutableSegmentImpl immutableSegment) {
 
   private void handleUpsert(ImmutableSegmentImpl immutableSegment) {
     String segmentName = immutableSegment.getSegmentName();
-    Integer partitionGroupId =
-        SegmentUtils.getRealtimeSegmentPartitionId(segmentName, _tableNameWithType, _helixManager,
-            _primaryKeyColumns.get(0));
-    Preconditions.checkNotNull(partitionGroupId,
-        String.format("PartitionGroupId is not available for segment: '%s' (upsert-enabled table: %s)", segmentName,
+    Integer partitionGroupId = SegmentUtils
+        .getRealtimeSegmentPartitionId(segmentName, _tableNameWithType, _helixManager, _primaryKeyColumns.get(0));
+    Preconditions.checkNotNull(partitionGroupId, String
+        .format("PartitionGroupId is not available for segment: '%s' (upsert-enabled table: %s)", segmentName,
             _tableNameWithType));
     PartitionUpsertMetadataManager partitionUpsertMetadataManager =
         _tableUpsertMetadataManager.getOrCreatePartitionManager(partitionGroupId);
+
     ThreadSafeMutableRoaringBitmap validDocIds = new ThreadSafeMutableRoaringBitmap();
+    if (immutableSegment.getValidDocIds() != null) {

Review Comment:
   fixed [here](https://github.com/apache/pinot/pull/9062/commits/56fb2a6b629473f5a64eb945597092854a39c3ba) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r934993468


##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/UpsertConfig.java:
##########
@@ -55,12 +55,16 @@ public enum Strategy {
   @JsonPropertyDescription("Column for upsert comparison, default to time column")
   private final String _comparisonColumn;
 
+  @JsonPropertyDescription("whether use snapshot for fast upsert metadata recovery")
+  private boolean _useSnapshot;
+
   @JsonCreator
   public UpsertConfig(@JsonProperty(value = "mode", required = true) Mode mode,
       @JsonProperty("partialUpsertStrategies") @Nullable Map<String, Strategy> partialUpsertStrategies,
       @JsonProperty("defaultPartialUpsertStrategy") @Nullable Strategy defaultPartialUpsertStrategy,
       @JsonProperty("comparisonColumn") @Nullable String comparisonColumn,
-      @JsonProperty("hashFunction") @Nullable HashFunction hashFunction) {
+      @JsonProperty("hashFunction") @Nullable HashFunction hashFunction,
+      @JsonProperty("useSnapshot") @Nullable boolean useSnapshot) {

Review Comment:
   `enableSnapshot` might be better



##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/UpsertConfig.java:
##########
@@ -55,12 +55,16 @@ public enum Strategy {
   @JsonPropertyDescription("Column for upsert comparison, default to time column")
   private final String _comparisonColumn;
 
+  @JsonPropertyDescription("whether use snapshot for fast upsert metadata recovery")
+  private boolean _useSnapshot;
+
   @JsonCreator
   public UpsertConfig(@JsonProperty(value = "mode", required = true) Mode mode,

Review Comment:
   We already got too many keys in this config. Suggest making a PR similar to #8743 to clean them up first.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +215,16 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocSnapshots(ImmutableRoaringBitmap validDocSnapshots) {

Review Comment:
   I feel a easier way to manage the snapshot is to add 3 methods into this class:
   - `@Nullable ImmutableRoaringBitmap loadValidDocIdsSnapshot()`
   - `void persistValidDocIdsSnapshot()`
   - `void deleteValidDocIdsSnapshot()`
   
   One scenario we want to handle is that when snapshot is enabled then disabled, we should be able to delete the existing snapshots.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +215,16 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocSnapshots(ImmutableRoaringBitmap validDocSnapshots) {
+    _validDocSnapshots = validDocSnapshots;
+  }
+
+  @Nullable
+  public ImmutableRoaringBitmap getValidDocSnapshots() {

Review Comment:
   (minor) Suggest renaming it to `getValidDocIdsSnapshot()`



##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/metadata/SegmentMetadataImpl.java:
##########
@@ -403,6 +408,10 @@ public Map<String, ColumnMetadata> getColumnMetadataMap() {
     return _columnMetadataMap;
   }
 
+  public boolean isUpsertSnapshotEnabled() {

Review Comment:
   We should not rely on segment metadata to determine whether snapshot is enabled. The upsert config might change anytime, and we want to use/drop snapshot accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r926044190


##########
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java:
##########
@@ -183,8 +183,8 @@ protected void doInit() {
       Preconditions.checkState(schema != null, "Failed to find schema for table: %s", _tableNameWithType);
 
       _primaryKeyColumns = schema.getPrimaryKeyColumns();
-      Preconditions.checkState(!CollectionUtils.isEmpty(_primaryKeyColumns),
-          "Primary key columns must be configured for dedup");
+      Preconditions

Review Comment:
   gotcha, thanks @Jackie-Jiang. let me re-import the codestyle and revert these.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
KKcorps commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r925370537


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -215,6 +216,13 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocIds(ImmutableRoaringBitmap validDocIds) {
+    for (int docId: validDocIds.toArray()) {
+      _validDocIds.add(docId);

Review Comment:
   We can use `MutableRoaringBitmap.add` operation here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r937012396


##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/UpsertConfig.java:
##########
@@ -55,12 +55,16 @@ public enum Strategy {
   @JsonPropertyDescription("Column for upsert comparison, default to time column")
   private final String _comparisonColumn;
 
+  @JsonPropertyDescription("whether use snapshot for fast upsert metadata recovery")
+  private boolean _useSnapshot;
+
   @JsonCreator
   public UpsertConfig(@JsonProperty(value = "mode", required = true) Mode mode,

Review Comment:
   updated in https://github.com/apache/pinot/pull/9150



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] deemoliu commented on a diff in pull request #9062: [Upsert] persist validDocsIndex snapshot for Pinot upsert optimization

Posted by GitBox <gi...@apache.org>.
deemoliu commented on code in PR #9062:
URL: https://github.com/apache/pinot/pull/9062#discussion_r929421458


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java:
##########
@@ -213,6 +214,13 @@ public List<StarTreeV2> getStarTrees() {
     return _starTreeIndexContainer != null ? _starTreeIndexContainer.getStarTrees() : null;
   }
 
+  @Nullable
+  public void setValidDocIds(ImmutableRoaringBitmap validDocIds) {
+    for (int docId: validDocIds.toArray()) {
+      _validDocIds.add(docId);

Review Comment:
   Thanks @yupeng9 
   
   Previously i tried to set the validDocIndex directly from snapshot. so i iterate validDocId from immutableRoaringBitmap to ThreadSafeRoaringBitmap. I synced with Jackie and learned this is not correct, so I fix it in [this commit](https://github.com/apache/pinot/pull/9062/commits/56fb2a6b629473f5a64eb945597092854a39c3ba)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org