You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2022/01/20 16:24:00 UTC

[jira] [Commented] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

    [ https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479491#comment-17479491 ] 

Vinoth Chandar commented on HUDI-512:
-------------------------------------

We have a really performant way of tracking column ranges and bloom filters now, scheduled to land in 0.11. But, usually queries involve a predicate in the following form _*function(column) <= value.*_ Partitioning is just one case of this. e.g consider a date partitioned table. Here, _KeyGenerator#getPartitionPath_ is the actual index function extracted from a column "ts", which adds a new partition column say "datestr" to the table. 
 * When users write "datestr >= ..." , we need to be able to translate that predicate into "ts >= " to perform file pruning.
 * In general, users should be able to add such index functions from SQL or otherwise. We ship some built-in ones, and users can customize them

For 0.11, if we can now target just a simple partitioning use-case, it might be good. 

> Support for Index functions on columns to generate logical or micro partitioning
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-512
>                 URL: https://issues.apache.org/jira/browse/HUDI-512
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>    Affects Versions: 0.9.0
>            Reporter: Alexander Filipchik
>            Priority: Major
>              Labels: features
>             Fix For: 0.11.0
>
>
> This one is more inspirational, but, I believe, will be very useful. Currently hudi is following Hive table format, which means that data is logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file name starts with some kind of a random value. By definition Hive layout is not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 tables with very similar information) and doesn't support proper filtering. Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive metastore as well). If dataset has a time columns, user should be able to query it without understanding what is the physical layout of the table (by specifying those partitions explicitly or ending up with a full table scan accidentally).
> It will require some kind of mapping of time to file locations (similar to Iceberg). I'm also leaning towards the idea that storing table metadata with the table is a good thing as it can be read by the engine in one shot and will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)