You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/26 11:45:05 UTC

[GitHub] [hive] pvary commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

pvary commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r639647076



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5218,31 +5218,55 @@ private void removeUnusedColumnDescriptor(MColumnDescriptor oldCD) {
       return;
     }
 
-    boolean success = false;
     Query query = null;
+    LOG.debug("execute removeUnusedColumnDescriptor");
+    DatabaseProduct dbProduct = DatabaseProduct.determineDatabaseProduct(MetaStoreDirectSql.getProductName(pm), conf);
 
-    try {
-      openTransaction();
-      LOG.debug("execute removeUnusedColumnDescriptor");
+    /**
+     * In order to workaround oracle not supporting limit statement caused performance issue, HIVE-9447 makes
+     * all the backend DB run select count(1) from SDS where SDS.CD_ID=? to check if the specific CD_ID is
+     * referenced in SDS table before drop a partition. This select count(1) statement does not scale well in
+     * Postgres, and there is no index for CD_ID column in SDS table.
+     * For a SDS table with with 1.5 million rows, select count(1) has average 700ms without index, while in
+     * 10-20ms with index. But the statement before
+     * HIVE-9447( SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1) uses less than 10ms .
+     */
 
-      query = pm.newQuery("select count(1) from " +
-        "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)");
+    if (dbProduct.isPOSTGRES()) {
+      query = pm.newQuery(MStorageDescriptor.class, "this.cd == inCD");
       query.declareParameters("MColumnDescriptor inCD");
-      long count = ((Long)query.execute(oldCD)).longValue();
-
+      List<MStorageDescriptor> referencedSDs = listStorageDescriptorsWithCD(oldCD, query);

Review comment:
       Is this really the fastest way to check if the `oldCD` is used?
   Since postgres supports `limit` we might want to use that here.
   Also my guess is that mysql could also use limit, so we might want to test the performance for different engines and use the appropriate query for them.
   
   What do you think?
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org