You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/02/14 08:58:45 UTC

[GitHub] [hive] kasakrisz opened a new pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

kasakrisz opened a new pull request #3020:
URL: https://github.com/apache/hive/pull/3020


   ### What changes were proposed in this pull request?
   HIVE-23949 introduces a caching layer between HS2 and HMS. This patch extend alter partition HMS API call with invalidating cached `Partition` entries after successful alter partition calls.
   Alter partition calls has several versions but only the one which depends on the `writeIdList` is affected.
   
   ### Why are the changes needed?
   Subsequent `getPartitionsByNames` calls return the cached `Partition` objects however alter partition may changed the underlying Partitions and the changes are not reflected in the cache objects. This can lead to data correctness issues for example when computing stats.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Incorrect results are fixed.
   
   ### How was this patch tested?
   ```
   mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestMiniLlapLocalCliDriver -Dqfile=stats_part_multi_insert_acid.q -pl itests/qtest -Pitests
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on pull request #3020:
URL: https://github.com/apache/hive/pull/3020#issuecomment-1038821249


   @soumyakanti3578 
   Could you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] soumyakanti3578 commented on a change in pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
soumyakanti3578 commented on a change in pull request #3020:
URL: https://github.com/apache/hive/pull/3020#discussion_r806091142



##########
File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientWithLocalCache.java
##########
@@ -493,6 +495,32 @@ protected GetPartitionsByNamesResult getPartitionsByNamesInternal(GetPartitionsB
     return super.getPartitionsByNamesInternal(rqst);
   }
 
+  @Override
+  public void alter_partitions(String catName, String dbName, String tblName, List<Partition> newParts,
+                               EnvironmentContext environmentContext, String writeIdList, long writeId)
+          throws TException {
+    super.alter_partitions(catName, dbName, tblName, newParts, environmentContext, writeIdList, writeId);
+
+    if (!isCacheEnabledAndInitialized() || mscLocalCache == null) {
+      return;
+    }
+
+    // invalidate cached Partition entries
+    List<String> processorCapabilitiesList = getProcessorCapabilities() == null ?
+            null : new ArrayList<>(Arrays.asList(getProcessorCapabilities()));

Review comment:
       Is it better to call `getProcessorCapabilities()` just once?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] soumyakanti3578 commented on a change in pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
soumyakanti3578 commented on a change in pull request #3020:
URL: https://github.com/apache/hive/pull/3020#discussion_r806091142



##########
File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientWithLocalCache.java
##########
@@ -493,6 +495,32 @@ protected GetPartitionsByNamesResult getPartitionsByNamesInternal(GetPartitionsB
     return super.getPartitionsByNamesInternal(rqst);
   }
 
+  @Override
+  public void alter_partitions(String catName, String dbName, String tblName, List<Partition> newParts,
+                               EnvironmentContext environmentContext, String writeIdList, long writeId)
+          throws TException {
+    super.alter_partitions(catName, dbName, tblName, newParts, environmentContext, writeIdList, writeId);
+
+    if (!isCacheEnabledAndInitialized() || mscLocalCache == null) {
+      return;
+    }
+
+    // invalidate cached Partition entries
+    List<String> processorCapabilitiesList = getProcessorCapabilities() == null ?
+            null : new ArrayList<>(Arrays.asList(getProcessorCapabilities()));

Review comment:
       Is it better to call `getProcessorCapabilities()` just once?

##########
File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientWithLocalCache.java
##########
@@ -493,6 +495,32 @@ protected GetPartitionsByNamesResult getPartitionsByNamesInternal(GetPartitionsB
     return super.getPartitionsByNamesInternal(rqst);
   }
 
+  @Override
+  public void alter_partitions(String catName, String dbName, String tblName, List<Partition> newParts,
+                               EnvironmentContext environmentContext, String writeIdList, long writeId)
+          throws TException {
+    super.alter_partitions(catName, dbName, tblName, newParts, environmentContext, writeIdList, writeId);
+
+    if (!isCacheEnabledAndInitialized() || mscLocalCache == null) {
+      return;
+    }
+
+    // invalidate cached Partition entries
+    List<String> processorCapabilitiesList = getProcessorCapabilities() == null ?
+            null : new ArrayList<>(Arrays.asList(getProcessorCapabilities()));
+
+    TableWatermark watermark = new TableWatermark(writeIdList, getTable(dbName, tblName).getId());
+    dbName =prependCatalogToDbName(catName, dbName, conf);

Review comment:
       Small change, need a space after `=`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz closed pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
kasakrisz closed pull request #3020:
URL: https://github.com/apache/hive/pull/3020


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on pull request #3020:
URL: https://github.com/apache/hive/pull/3020#issuecomment-1039069121


   @zabetak 
   I found that we should invalidate the entries with keytype `KeyType.PARTITIONS_BY_NAMES`. It seems that these entries are created only when the method `getPartitionsByNamesInternal` called.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on pull request #3020:
URL: https://github.com/apache/hive/pull/3020#issuecomment-1047442805


   Closing this. Issue resolved by #3030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #3020:
URL: https://github.com/apache/hive/pull/3020#discussion_r805798350



##########
File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientWithLocalCache.java
##########
@@ -493,6 +495,32 @@ protected GetPartitionsByNamesResult getPartitionsByNamesInternal(GetPartitionsB
     return super.getPartitionsByNamesInternal(rqst);
   }
 
+  @Override
+  public void alter_partitions(String catName, String dbName, String tblName, List<Partition> newParts,
+                               EnvironmentContext environmentContext, String writeIdList, long writeId)
+          throws TException {
+    super.alter_partitions(catName, dbName, tblName, newParts, environmentContext, writeIdList, writeId);
+
+    if (!isCacheEnabledAndInitialized() || mscLocalCache == null) {
+      return;
+    }
+
+    // invalidate cached Partition entries
+    List<String> processorCapabilitiesList = getProcessorCapabilities() == null ?
+            null : new ArrayList<>(Arrays.asList(getProcessorCapabilities()));
+
+    TableWatermark watermark = new TableWatermark(writeIdList, getTable(dbName, tblName).getId());

Review comment:
       I've seen that other methods in this class are checking if the watermark is valid. Are we sure it is not necessary here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kasakrisz commented on a change in pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
kasakrisz commented on a change in pull request #3020:
URL: https://github.com/apache/hive/pull/3020#discussion_r805819187



##########
File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientWithLocalCache.java
##########
@@ -493,6 +495,32 @@ protected GetPartitionsByNamesResult getPartitionsByNamesInternal(GetPartitionsB
     return super.getPartitionsByNamesInternal(rqst);
   }
 
+  @Override
+  public void alter_partitions(String catName, String dbName, String tblName, List<Partition> newParts,
+                               EnvironmentContext environmentContext, String writeIdList, long writeId)
+          throws TException {
+    super.alter_partitions(catName, dbName, tblName, newParts, environmentContext, writeIdList, writeId);
+
+    if (!isCacheEnabledAndInitialized() || mscLocalCache == null) {
+      return;
+    }
+
+    // invalidate cached Partition entries
+    List<String> processorCapabilitiesList = getProcessorCapabilities() == null ?
+            null : new ArrayList<>(Arrays.asList(getProcessorCapabilities()));
+
+    TableWatermark watermark = new TableWatermark(writeIdList, getTable(dbName, tblName).getId());

Review comment:
       If `TableWatermark` was not valid when `getPartitionsByNamesInternal` was called then no entries were put into the cache otherwise Partition list was cached.
   When `alter_partition` called we calculate the `TableWatermark` since it is part of the cache key then invalidate the entry with that key:
   * if there were no entry nothing happens
   * if there was it is invalidated.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] soumyakanti3578 commented on a change in pull request #3020: HIVE-25918: Invalid stats after multi inserting into the same partition - ADDENDUM

Posted by GitBox <gi...@apache.org>.
soumyakanti3578 commented on a change in pull request #3020:
URL: https://github.com/apache/hive/pull/3020#discussion_r806092134



##########
File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientWithLocalCache.java
##########
@@ -493,6 +495,32 @@ protected GetPartitionsByNamesResult getPartitionsByNamesInternal(GetPartitionsB
     return super.getPartitionsByNamesInternal(rqst);
   }
 
+  @Override
+  public void alter_partitions(String catName, String dbName, String tblName, List<Partition> newParts,
+                               EnvironmentContext environmentContext, String writeIdList, long writeId)
+          throws TException {
+    super.alter_partitions(catName, dbName, tblName, newParts, environmentContext, writeIdList, writeId);
+
+    if (!isCacheEnabledAndInitialized() || mscLocalCache == null) {
+      return;
+    }
+
+    // invalidate cached Partition entries
+    List<String> processorCapabilitiesList = getProcessorCapabilities() == null ?
+            null : new ArrayList<>(Arrays.asList(getProcessorCapabilities()));
+
+    TableWatermark watermark = new TableWatermark(writeIdList, getTable(dbName, tblName).getId());
+    dbName =prependCatalogToDbName(catName, dbName, conf);

Review comment:
       Small change, need a space after `=`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org