You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/12 14:52:14 UTC

[GitHub] [iceberg] marton-bod opened a new pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

marton-bod opened a new pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329


   This patch:
   
   - Introduces a new snapshot summary metric for `total-files-size`. It was somehow missing up till now, even though it has its companion metrics `added-files-size` and `removed-files-size`. Introducing this total metric makes it consistent with the other 'metric groups'.
   - On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-800173681


   >I think it'd still be good to implement `InputEstimator` in the `HiveIcebergStorageHandler` to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky [..]
   
   I have checked the Hive code and only `InputEstimatorTestClass` implements the `InputEstimator`.
   Also checked the usages of `Estimation` object. Only `getTotalLength` is used. Once for determining if we can use fetch task instead of spawning an MR/Tez job, and then for generating `ContentSummary.length`. This later one is used several places, so it can be useful but we definitely need specific use-cases to check how much effort should we put into calculating the value
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r593276558



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,13 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));
+    // we don't have the uncompressed file sizes, so we use the totalSize (size on disk) as an estimate for rawDataSize
+    parameters.put(StatsSetupConst.RAW_DATA_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       Indeed, `totalSize` is enough for the compilation speed-up!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-797583632


   @aokolnychyi: any concerns about adding `TOTAL_FILE_SIZE_PROP` to the `SnapshotSummary`?
   
   The Hive part looks good to me, so I will merge if there is no concerns about the core side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-804416279


   Late +1 from me too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595130883



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,11 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       I checked with a `PropertiesUpdate`, and it does not produce a new snapshot, so `metadata.currentSnapshot()` will be the same as before along with its summary object.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary merged pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary merged pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r593244294



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,13 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));
+    // we don't have the uncompressed file sizes, so we use the totalSize (size on disk) as an estimate for rawDataSize
+    parameters.put(StatsSetupConst.RAW_DATA_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       Sure! thanks for pointing it out




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r594637395



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,11 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       If the summary property is missing, should we set the Hive property? I think this is correct only if `"0"` indicates to Hive that the value is not known.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-800981805


   Thanks for the PR @marton-bod!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595216439



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,17 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    if (summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP) != null) {
+      parameters.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+    }

Review comment:
       I don't think so. For example, when we create a new table, Hive already puts numRows=0, totalSize=0, etc. into the HMS table params, so we would just end up removing those valid values here. Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out. What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595206995



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,17 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    if (summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP) != null) {
+      parameters.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+    }

Review comment:
       Should we remove them if we do not have summary data?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r593241677



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,13 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));
+    // we don't have the uncompressed file sizes, so we use the totalSize (size on disk) as an estimate for rawDataSize
+    parameters.put(StatsSetupConst.RAW_DATA_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       As per my latest investigation, we might get away with filling only `TOTAL_SIZE`, and leaving `RAW_DATA_SIZE` blank. Based on this code I expect that we still would get rid of the file listing if the `TOTAL_SIZE` is not `0`:
   
   https://github.com/apache/hive/blob/a0034284fe02a5012f883704fcd57652519a4cd5/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L202
   
   Could you please check this out?
   
   Thanks,
   Peter




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595123298



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,11 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       When creating a table, `metadata.currentSnapshot()` is null, therefore we won't have a summary and will create an empty map for it. I think @rdblue is right that we should only update the HMS values if they're actually present in the summary. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595084525



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,11 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       Good question.
   Based on [this](https://github.com/apache/hive/blob/a0034284fe02a5012f883704fcd57652519a4cd5/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L238-L250) Hive considers `0` as a non valid statistics value anyway:
   ```
     public BasicStats(Partish p) {
       partish = p;
   
       rowCount = parseLong(StatsSetupConst.ROW_COUNT);
   [..]
       currentNumRows = rowCount;
   [..]
       if (currentNumRows > 0) {
         state = State.COMPLETE;
       } else {
         state = State.NONE;
       }
     }
   ```
   
   But I agree that unsetting the value would be more intuitive. What is the content of the `SnapshotSummary` in case of a new table? Setting `0` there would still be good (even if Hive does not consider it as a valid one)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595125097



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,11 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       Should we remove them if they are not present for whatever reasons?
   
   What happens if a commit is a metadata only change? Do we still have the summary?
   
   Thanks,
   Peter




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] edgarRd commented on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
edgarRd commented on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-799740810


   Thanks for working on this. I think this is great to have the stats in the Hive table metadata. 
   
   However, I think it'd still be good to implement `InputEstimator` in the `HiveIcebergStorageHandler` to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky in Iceberg since we'd need to compute the split tasks again or cache them. I guess initially computing them again (if we can reuse the code) would be okay to avoid caching.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] edgarRd edited a comment on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
edgarRd edited a comment on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-799740810


   Thanks for working on this. I think this is great to have the stats in the Hive table metadata. 
   
   However, I think it'd still be good to implement `InputEstimator` in the `HiveIcebergStorageHandler` to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky in Iceberg since we'd need to compute the split tasks again or cache them. I guess initially computing them again (if we can reuse the code) would be okay to avoid caching  - although it could be slow is it's a very selective scan, in that case maybe we could do a quick stop if too large and use table stats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595222888



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,17 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    if (summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP) != null) {
+      parameters.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+    }

Review comment:
       Makes sense. Hope this is true:
   > Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#issuecomment-799514404


   This looks good to me


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r595104920



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,11 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       Indeed, with `"0"`, Hive treats it as unknown. It checks whether this value is `<= 0` and if so, then launches the stat collection task.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #2329: Core/Hive: Introduce total-files-size snapshot metric and populate HMS

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2329:
URL: https://github.com/apache/iceberg/pull/2329#discussion_r593276558



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -296,6 +302,13 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
       parameters.remove(hive_metastoreConstants.META_TABLE_STORAGE);
     }
 
+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));
+    // we don't have the uncompressed file sizes, so we use the totalSize (size on disk) as an estimate for rawDataSize
+    parameters.put(StatsSetupConst.RAW_DATA_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));

Review comment:
       Indeed, `totalSize` is enough for the compilation speed-up! Removed the population of the raw data size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org