You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Forward Xu (Jira)" <ji...@apache.org> on 2022/03/15 01:25:00 UTC

[jira] [Commented] (HUDI-2175) Support dynamic schemas with hudi

    [ https://issues.apache.org/jira/browse/HUDI-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506657#comment-17506657 ] 

Forward Xu commented on HUDI-2175:
----------------------------------

hi [~shivnarayan] I think this implementation is compatible with query, but it is not good enough. This scenario is very common in machine learning and feature engineering. Several features (data columns) are calculated each time through the machine learning algorithm.

I think we should avoid loading all the data when reading the required columns and then filtering. We should support column storage first. For example, we need to add column family like HBase, write separate data files according to the columns when writing data, and read according to the columns when reading.

> Support dynamic schemas with hudi
> ---------------------------------
>
>                 Key: HUDI-2175
>                 URL: https://issues.apache.org/jira/browse/HUDI-2175
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Common Core
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available, sev:high
>
> Sometimes, users have a requirement where they have different producers and each producer produces only a subset of columns. 
>  
> for eg:
> Producer 1: rec_key, colA, colB, colC
> Producer 2: rec_key, colC, colD, colE, colF
> Producer 3: rec_key, colB, colF, colI, colK
>  
> Expectation from hudi:
> keep merging new columns and inject defaults values for all other missing columns. 
>  
> So, for above usecase, final hudi table's schema is expected to be 
> rec_key, colA, colB, colC, colD, colE, colF, colI, colK
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)