You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "vinoyang (Jira)" <ji...@apache.org> on 2019/11/18 11:40:00 UTC

[jira] [Comment Edited] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

    [ https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976473#comment-16976473 ] 

vinoyang edited comment on HUDI-64 at 11/18/19 11:39 AM:
---------------------------------------------------------

[~thw] Since you have been Flink PMC for a long time, can you help me to judge my analysis is correct: https://issues.apache.org/jira/browse/HUDI-64?focusedCommentId=16973099&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16973099

I am still trying to find a solution to solve Spark#persist in Flink domain.


was (Author: yanghua):
[~thomasWeise] Since you have been Flink PMC for a long time, can you help me to judge my analysis is correct: https://issues.apache.org/jira/browse/HUDI-64?focusedCommentId=16973099&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16973099

I am still trying to find a solution to solve Spark#persist in Flink domain.

> Estimation of compression ratio & other dynamic storage knobs based on historical stats
> ---------------------------------------------------------------------------------------
>
>                 Key: HUDI-64
>                 URL: https://issues.apache.org/jira/browse/HUDI-64
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Storage Management, Write Client
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> Something core to Hudi writing is using heuristics or runtime workload statistics to optimize aspects of storage like file sizes, partitioning and so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46] . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it has written for a given parquet file and closes the parquet file once the configured size has reached. DFSOutputStream level we only know bytes written before compression. Once enough data has been written, it should be possible to replace this by a simple estimate of what the avg record size would be (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52] We write data into logs in avro and can log updates to same record in parquet multiple times. We need to estimate again how large the log file(s) can grow to, and still we would be able to produce a parquet file of configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java] caches the input records using Spark Caching and computes the shape of the workload, i.e how many records per partition, how many inserts vs updates etc. This is used by the Partitioner here [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141] for assigning records to a file group. This is the critical one to replace for Flink support and probably the hardest, since we need to guess input, which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756] from the last commit metadata alone. This can be generalized.  (default : [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71]) 
>  # 
> Our goal in this Jira is to see, if could derive this information in the background purely using the commit metadata.. Some parts of this are open-ended.. Good starting point would be to see whats feasible, estimate ROI before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)