You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/08/06 17:50:23 UTC

[GitHub] [hive] vihangk1 commented on a change in pull request #1330: HIVE-23890: Create HMS endpoint for querying file lists using FlatBuf…

vihangk1 commented on a change in pull request #1330:
URL: https://github.com/apache/hive/pull/1330#discussion_r466575514



##########
File path: standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
##########
@@ -1861,6 +1861,19 @@ struct ScheduledQueryProgressInfo{
   4: optional string errorMessage,
 }
 
+struct GetFileListRequest {

Review comment:
       This naming of this struct could be more generic (may be a better name could be GetFileMetadataRequest). Also did you consider reusing existing/extending getFileMetadata HMS API?

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
##########
@@ -5685,6 +5706,67 @@ private void alter_table_core(String catName, String dbname, String name, Table
       }
     }
 
+    @Override
+    public GetFileListResponse get_file_list(GetFileListRequest req) throws MetaException {
+      String catName = req.isSetCatName() ? req.getCatName() : getDefaultCatalog(conf);
+      String dbName = req.getDbName();
+      String tblName = req.getTableName();
+      List<String> partitions = req.getPartVals();
+      // Will be used later, when cache is introduced
+      String validWriteIdList = req.getValidWriteIdList();
+
+      startFunction("get_file_list", ": " + TableName.getQualified(catName, dbName, tblName)
+              + ", partitions: " + partitions.toString());
+
+
+      GetFileListResponse response = new GetFileListResponse();
+
+      boolean success = false;
+      Exception ex = null;
+      try {
+        Partition p =  getMS().getPartition(catName, dbName, tblName, partitions);
+        Path path = new Path(p.getSd().getLocation());
+
+        FileSystem fs = path.getFileSystem(conf);
+        RemoteIterator<LocatedFileStatus> itr = fs.listFiles(path, true);
+        while (itr.hasNext()) {
+          FileStatus fStatus = itr.next();
+          Reader reader = OrcFile.createReader(fStatus.getPath(), OrcFile.readerOptions(fs.getConf()));

Review comment:
       Does this assume that the request is always for a ORC table?

##########
File path: standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
##########
@@ -1861,6 +1861,19 @@ struct ScheduledQueryProgressInfo{
   4: optional string errorMessage,
 }
 
+struct GetFileListRequest {
+  1: optional string catName,
+  2: optional string dbName,
+  3: optional string tableName,
+  4: optional list<string> partVals,
+  6: optional string validWriteIdList
+}
+
+struct GetFileListResponse {

Review comment:
       I think it will be useful to have a separate struct defined for the FileMetadata which also includes a type field. For instance, I can see this being useful for various engines and the FileMetadata format could be different for each engine. For instance, engine1 may require FileStatus, BlockInformation while engine2 only is interested in FileStatus, FilemodificationTime, while there is another file-metadata type for ACID state which includes some ACID specific information like fileformat that you used below.

##########
File path: standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
##########
@@ -2454,10 +2467,14 @@ PartitionsResponse get_partitions_req(1:PartitionsRequest req)
   // partition keys in new_part should be the same as those in old partition.
   void rename_partition(1:string db_name, 2:string tbl_name, 3:list<string> part_vals, 4:Partition new_part)
                        throws (1:InvalidOperationException o1, 2:MetaException o2)
-  
+
   RenamePartitionResponse rename_partition_req(1:RenamePartitionRequest req)
                        throws (1:InvalidOperationException o1, 2:MetaException o2)
 
+  // Returns a file list using FlatBuffers as serialization
+  GetFileListResponse get_file_list(1:GetFileListRequest req)
+    throws(1:MetaException o1)

Review comment:
       Seems like if the partition doesn't exist you might need to throw a NoSuchObjectFoundException as well.

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
##########
@@ -5685,6 +5706,67 @@ private void alter_table_core(String catName, String dbname, String name, Table
       }
     }
 
+    @Override
+    public GetFileListResponse get_file_list(GetFileListRequest req) throws MetaException {
+      String catName = req.isSetCatName() ? req.getCatName() : getDefaultCatalog(conf);
+      String dbName = req.getDbName();
+      String tblName = req.getTableName();
+      List<String> partitions = req.getPartVals();
+      // Will be used later, when cache is introduced
+      String validWriteIdList = req.getValidWriteIdList();
+
+      startFunction("get_file_list", ": " + TableName.getQualified(catName, dbName, tblName)
+              + ", partitions: " + partitions.toString());
+
+
+      GetFileListResponse response = new GetFileListResponse();
+
+      boolean success = false;
+      Exception ex = null;
+      try {
+        Partition p =  getMS().getPartition(catName, dbName, tblName, partitions);
+        Path path = new Path(p.getSd().getLocation());
+
+        FileSystem fs = path.getFileSystem(conf);
+        RemoteIterator<LocatedFileStatus> itr = fs.listFiles(path, true);
+        while (itr.hasNext()) {
+          FileStatus fStatus = itr.next();
+          Reader reader = OrcFile.createReader(fStatus.getPath(), OrcFile.readerOptions(fs.getConf()));
+          boolean isRawFormat  = !CollectionUtils.isEqualCollection(reader.getSchema().getFieldNames(), ALL_ACID_ROW_NAMES);
+          int fileFormat = isRawFormat ? 0 : 2;

Review comment:
       I think fileFormat having 0 or 2 looks very arbitary. We should atleast define the valid values of this field and what each value means.

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
##########
@@ -5685,6 +5706,67 @@ private void alter_table_core(String catName, String dbname, String name, Table
       }
     }
 
+    @Override
+    public GetFileListResponse get_file_list(GetFileListRequest req) throws MetaException {
+      String catName = req.isSetCatName() ? req.getCatName() : getDefaultCatalog(conf);
+      String dbName = req.getDbName();
+      String tblName = req.getTableName();
+      List<String> partitions = req.getPartVals();
+      // Will be used later, when cache is introduced
+      String validWriteIdList = req.getValidWriteIdList();
+
+      startFunction("get_file_list", ": " + TableName.getQualified(catName, dbName, tblName)
+              + ", partitions: " + partitions.toString());
+
+
+      GetFileListResponse response = new GetFileListResponse();
+
+      boolean success = false;
+      Exception ex = null;
+      try {
+        Partition p =  getMS().getPartition(catName, dbName, tblName, partitions);
+        Path path = new Path(p.getSd().getLocation());
+
+        FileSystem fs = path.getFileSystem(conf);
+        RemoteIterator<LocatedFileStatus> itr = fs.listFiles(path, true);
+        while (itr.hasNext()) {
+          FileStatus fStatus = itr.next();
+          Reader reader = OrcFile.createReader(fStatus.getPath(), OrcFile.readerOptions(fs.getConf()));
+          boolean isRawFormat  = !CollectionUtils.isEqualCollection(reader.getSchema().getFieldNames(), ALL_ACID_ROW_NAMES);
+          int fileFormat = isRawFormat ? 0 : 2;
+          response.addToFileListData(createFileStatus(fStatus, p.getSd().getLocation(), fileFormat));
+        }
+        success = true;
+      } catch (Exception e) {
+        ex = e;
+        LOG.error("Failed to list files with error: " + e.getMessage());
+        throw new MetaException(e.getMessage());
+      } finally {
+        endFunction("get_file_list", success, ex);
+      }
+      return response;
+    }
+
+    private ByteBuffer createFileStatus(FileStatus fileStatus, String basePath, int fileFormat) {

Review comment:
       Please add a java doc here. Also, would be good to attribute the source where this code is from with a link to the file.

##########
File path: standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
##########
@@ -1861,6 +1861,19 @@ struct ScheduledQueryProgressInfo{
   4: optional string errorMessage,
 }
 
+struct GetFileListRequest {
+  1: optional string catName,
+  2: optional string dbName,
+  3: optional string tableName,
+  4: optional list<string> partVals,

Review comment:
       Does this mean the file list is retrieved on 1 partition at a time? It seems like we can have bulk retrieval for a given list of partition names?

##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
##########
@@ -240,6 +247,20 @@
   private static ZooKeeperHiveHelper zooKeeperHelper = null;
   private static String msHost = null;
 
+  static final String OPERATION_FIELD_NAME = "operation";

Review comment:
       Some documentation of these constants will help improve the readability.

##########
File path: standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
##########
@@ -1861,6 +1861,19 @@ struct ScheduledQueryProgressInfo{
   4: optional string errorMessage,
 }
 
+struct GetFileListRequest {
+  1: optional string catName,
+  2: optional string dbName,
+  3: optional string tableName,
+  4: optional list<string> partVals,
+  6: optional string validWriteIdList
+}
+
+struct GetFileListResponse {
+  1: optional list<binary> fileListData,
+  2: optional i32 fbVersionNumber

Review comment:
       It would be good to have a comment here saying what the version number represents. Also, fbVersionNumber sounds like its a flatbuffer specific version number? Is that correct? Why does this field need to be optional?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org