You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/04/04 02:12:00 UTC

[jira] [Closed] (HUDI-3772) We automatically enable InProcessLockProvider and lazy rollbacks in spark datasoruce write if compaction configs are not set for MOR

     [ https://issues.apache.org/jira/browse/HUDI-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan closed HUDI-3772.
-------------------------------------
    Resolution: Fixed

> We automatically enable InProcessLockProvider and lazy rollbacks in spark datasoruce write if compaction configs are not set for MOR
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-3772
>                 URL: https://issues.apache.org/jira/browse/HUDI-3772
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: configs, multi-writer
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>
> Sometime back, we added a fix to hudi, where in we automatically detect if any async table services are enabled and if no lock providers are configured, we automatically enable InProcessLockProvider, OCC and lazy rollbacks. This is a pre-requisite for enabling metadata table and hence we had put in this fix. 
>  
> This worked out well for COW, clustering. But for MOR, it was tricky, and we had to have explicit checks for below condition and auto enable it
> if table type = MOR and if compaction is async -> enable InProcessLockProvider. 
> bcoz, for COW there is no compaction, but for MOR, compaction has to be enabled. its a question of whether its inline or async. 
>  
> This all works out well, if user explicitly sets the compaction config as below
> {code:java}
> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
>      |   option("hoodie.compact.inline","true").
>      |   option("hoodie.compact.inline.max.delta.commits","2").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Append).
>      |   save(basePath) {code}
>  
> So, we clearly detect its inline and do not enable InProcessLockProvier. 
> Auto detection also works well w/ Deltastreamer code path, since we can clearly detect whether compaction is inline or async. for inline, Deltastreamer will explicitly set "hoodie.compact.inline" to "true".
>  
> But the tricky part is, with spark datasource, if user skips the compaction config altogether, we auto detect that its inline and go ahead and enable inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this is a behavior change for a simple single writer coming from 0.10.0. 
>  
> {code:java}
> df2.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Append).
>      |   save(basePath) {code}
>  
> Reason is that, as per code, default value for "hoodie.compact.inline" is "false". And so we deduce that, compaction is async if user does not explicitly set it.
>  
> We have to find a way to fix this. 
> May be, in a production pipeline, its likely every write will have compaction configs set. I don't see why someone will have compaction configs set for few writes and not for others. But lets try to see if we can maintain the same behavior. 
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)