You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/07/08 12:59:10 UTC

[GitHub] [hive] kgyrtkirk commented on a change in pull request #1105: HIVE-22957: Support Partition Filtering In MSCK REPAIR TABLE Command

kgyrtkirk commented on a change in pull request #1105:
URL: https://github.com/apache/hive/pull/1105#discussion_r451502201



##########
File path: parser/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g
##########
@@ -734,6 +734,21 @@ dropPartitionOperator
     EQUAL | NOTEQUAL | LESSTHANOREQUALTO | LESSTHAN | GREATERTHANOREQUALTO | GREATERTHAN
     ;
 
+filterPartitionSpec
+    :
+    LPAREN filterPartitionVal (COMMA  filterPartitionVal )* RPAREN -> ^(TOK_PARTSPEC filterPartitionVal +)
+    ;
+
+filterPartitionVal
+    :
+    identifier filterPartitionOperator constant -> ^(TOK_PARTVAL identifier filterPartitionOperator constant)

Review comment:
       old `partitionSpec` doesn't mandatorily required the constant
   ```
   identifier (EQUAL constant)? 
   ```
   
   were there any use cases of that?

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
##########
@@ -383,7 +375,29 @@ void findUnknownPartitions(Table table, Set<Path> partPaths,
     // now check the table folder and see if we find anything
     // that isn't in the metastore
     Set<Path> allPartDirs = new HashSet<Path>();
+    Set<Path> partDirs = new HashSet<Path>();
+    List<FieldSchema> partColumns = table.getPartitionKeys();
     checkPartitionDirs(tablePath, allPartDirs, Collections.unmodifiableList(getPartColNames(table)));
+
+    if (filterExp != null) {
+      PartitionExpressionProxy expressionProxy = createExpressionProxy(conf);
+      List<String> paritions = new ArrayList<>();
+      for (Path path : allPartDirs) {
+        // remove the table's path from the partition path
+        // eg: <tablePath>/p1=1/p2=2/p3=3 ---> p1=1/p2=2/p3=3
+        paritions.add(path.toString().substring(tablePath.toString().length() + 1));
+      }
+      // Remove all partition paths which does not matches the filter expression.
+      expressionProxy.filterPartitionsByExpr(partColumns, filterExp,
+          conf.get(MetastoreConf.ConfVars.DEFAULTPARTITIONNAME.getVarname()), paritions);
+
+      // now the partition list will contain all the paths that matches the filter expression.
+      // add them back to partDirs.
+      for (String path : paritions) {
+        partDirs.add(new Path(tablePath.toString() + "/" + path));

Review comment:
       instead of concatenating with `/` use `new Path(parentPath,child)` - it's more portable

##########
File path: itests/src/test/resources/testconfiguration.properties
##########
@@ -222,6 +222,7 @@ mr.query.files=\
   mapjoin_subquery2.q,\
   mapjoin_test_outer.q,\
   masking_5.q,\
+  msck_repair_filter.q,\

Review comment:
       is there a reason that we run this test with mr?

##########
File path: parser/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
##########
@@ -1942,9 +1942,8 @@ metastoreCheck
 @after { popMsg(state); }
     : KW_MSCK (repair=KW_REPAIR)?
       (KW_TABLE tableName
-        ((add=KW_ADD | drop=KW_DROP | sync=KW_SYNC) (parts=KW_PARTITIONS))? |
-        (partitionSpec)?)
-    -> ^(TOK_MSCK $repair? tableName? $add? $drop? $sync? (partitionSpec*)?)
+        ((add=KW_ADD | drop=KW_DROP | sync=KW_SYNC) (parts=KW_PARTITIONS) (filterPartitionSpec)?)?)
+    -> ^(TOK_MSCK $repair? tableName? $add? $drop? $sync? (filterPartitionSpec)?)

Review comment:
       I know it was here before - but let's fix this up:
   
   instead of separate add/drop/sync variable ...we could have `opt=(KW_ADD|KW_DROP|KW_SYNC)` ? that will make the other end more readable as well

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java
##########
@@ -63,13 +67,24 @@ public void analyzeInternal(ASTNode root) throws SemanticException {
     }
 
     Table table = getTable(tableName);
-    List<Map<String, String>> specs = getPartitionSpecs(table, root);
+    Map<Integer, List<ExprNodeGenericFuncDesc>> partitionSpecs = getFullPartitionSpecs(root, table, conf, false);
+    byte[] filterExp = null;
+    if (partitionSpecs != null & !partitionSpecs.isEmpty()) {
+      // explicitly set expression proxy class to PartitionExpressionForMetastore since we intend to use the
+      // filterPartitionsByExpr of PartitionExpressionForMetastore for partition pruning down the line.
+      conf.set(MetastoreConf.ConfVars.EXPRESSION_PROXY_CLASS.getVarname(),

Review comment:
       I don't think this will work - this is the ql module ; while `EXPRESSION_PROXY_CLASS` is a metastore conf key; in a remote metastore setup this set will probably have no effect...
   have you tried it?
   I think making a check and returning with an error that this feature is not available due to required conf change is fine

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java
##########
@@ -63,13 +67,24 @@ public void analyzeInternal(ASTNode root) throws SemanticException {
     }
 
     Table table = getTable(tableName);
-    List<Map<String, String>> specs = getPartitionSpecs(table, root);
+    Map<Integer, List<ExprNodeGenericFuncDesc>> partitionSpecs = getFullPartitionSpecs(root, table, conf, false);
+    byte[] filterExp = null;
+    if (partitionSpecs != null & !partitionSpecs.isEmpty()) {
+      // explicitly set expression proxy class to PartitionExpressionForMetastore since we intend to use the
+      // filterPartitionsByExpr of PartitionExpressionForMetastore for partition pruning down the line.
+      conf.set(MetastoreConf.ConfVars.EXPRESSION_PROXY_CLASS.getVarname(),
+          PartitionExpressionForMetastore.class.getCanonicalName());
+      // fetch the first value of partitionSpecs map since it will always have one key, value pair
+      filterExp = SerializationUtilities.serializeExpressionToKryo(

Review comment:
       why this needs to be flattened into a `byte[]` ?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java
##########
@@ -837,6 +844,118 @@ public static void checkColumnName(String columnName) throws SemanticException {
     return colList;
   }
 
+  /**
+   * Get the partition specs from the tree. This stores the full specification
+   * with the comparator operator into the output list.
+   *
+   * @return Map of partitions by prefix length. Most of the time prefix length will
+   *         be the same for all partition specs, so we can just OR the expressions.
+   */
+  public static Map<Integer, List<ExprNodeGenericFuncDesc>> getFullPartitionSpecs(

Review comment:
       can we find a new home for these 2 `static` methods? :)
   `ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java`

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
##########
@@ -383,7 +375,29 @@ void findUnknownPartitions(Table table, Set<Path> partPaths,
     // now check the table folder and see if we find anything
     // that isn't in the metastore
     Set<Path> allPartDirs = new HashSet<Path>();
+    Set<Path> partDirs = new HashSet<Path>();

Review comment:
       move this variable inside the if

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
##########
@@ -240,40 +243,27 @@ void checkTable(String catName, String dbName, String tableName,
     }
 
     PartitionIterable parts;
-    boolean findUnknownPartitions = true;
 
     if (isPartitioned(table)) {
-      if (partitions == null || partitions.isEmpty()) {
+      if (filterExp != null) {
+        List<Partition> results = new ArrayList<>();
+        getPartitionListByFilterExp(getMsc(), table, filterExp,

Review comment:
       I wonder if there is a way to retain `filterExp` in a more natural way....it will be kryo-encoded almost all the time...but seems like the metastore interface method was designed to accept kryo stuff...

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java
##########
@@ -330,17 +330,6 @@ public void testPartitionsCheck() throws HiveException,
     assertEquals(partToRemove.getTable().getTableName(),
         result.getPartitionsNotOnFs().iterator().next().getTableName());
     assertEquals(Collections.<CheckResult.PartitionResult>emptySet(), result.getPartitionsNotInMs());
-
-    List<Map<String, String>> partsCopy = new ArrayList<Map<String, String>>();
-    partsCopy.add(partitions.get(1).getSpec());

Review comment:
       is there a successor of this test?

##########
File path: parser/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g
##########
@@ -734,6 +734,21 @@ dropPartitionOperator
     EQUAL | NOTEQUAL | LESSTHANOREQUALTO | LESSTHAN | GREATERTHANOREQUALTO | GREATERTHAN
     ;
 
+filterPartitionSpec
+    :
+    LPAREN filterPartitionVal (COMMA  filterPartitionVal )* RPAREN -> ^(TOK_PARTSPEC filterPartitionVal +)
+    ;
+
+filterPartitionVal
+    :
+    identifier filterPartitionOperator constant -> ^(TOK_PARTVAL identifier filterPartitionOperator constant)
+    ;
+
+filterPartitionOperator
+    :
+    EQUAL | NOTEQUAL | LESSTHANOREQUALTO | LESSTHAN | GREATERTHANOREQUALTO | GREATERTHAN | KW_LIKE

Review comment:
       `dropPartitionSpec` seems to use almost the same construct ; I don't see any reason to duplicate it ...
   the only difference I see right now is `LIKE` - are there any other differences?
   
   I think instead of duplicate we should use the same stuff...

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
##########
@@ -383,7 +375,29 @@ void findUnknownPartitions(Table table, Set<Path> partPaths,
     // now check the table folder and see if we find anything
     // that isn't in the metastore
     Set<Path> allPartDirs = new HashSet<Path>();
+    Set<Path> partDirs = new HashSet<Path>();
+    List<FieldSchema> partColumns = table.getPartitionKeys();
     checkPartitionDirs(tablePath, allPartDirs, Collections.unmodifiableList(getPartColNames(table)));
+
+    if (filterExp != null) {
+      PartitionExpressionProxy expressionProxy = createExpressionProxy(conf);
+      List<String> paritions = new ArrayList<>();
+      for (Path path : allPartDirs) {
+        // remove the table's path from the partition path
+        // eg: <tablePath>/p1=1/p2=2/p3=3 ---> p1=1/p2=2/p3=3
+        paritions.add(path.toString().substring(tablePath.toString().length() + 1));

Review comment:
       I'm wondering if `tablePath` could end with a '/' or not; if it does, and `checkPartitionDirs` are removing double slashes this could eat up 1 extra char...

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreServerUtils.java
##########
@@ -1348,6 +1348,17 @@ public static Path getPath(Table table) {
     }
   }
 
+  public static void getPartitionListByFilterExp(IMetaStoreClient msc, Table table, byte[] filterExp,
+                                                 String defaultPartName, List<Partition> results)
+      throws MetastoreException {
+    try {
+      msc.listPartitionsByExpr(table.getCatName(), table.getDbName(), table.getTableName(), filterExp,

Review comment:
       this method accepts `byte[]` and if I'm not wrong this is like this since around 2013 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org