You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/04/01 14:26:00 UTC

[jira] [Created] (HUDI-3772) We automatically enable InProcessLockProvider and lazy rollbacks in spark datasoruce write if compaction configs are not set for MOR

sivabalan narayanan created HUDI-3772:
-----------------------------------------

             Summary: We automatically enable InProcessLockProvider and lazy rollbacks in spark datasoruce write if compaction configs are not set for MOR
                 Key: HUDI-3772
                 URL: https://issues.apache.org/jira/browse/HUDI-3772
             Project: Apache Hudi
          Issue Type: Bug
          Components: configs, multi-writer
            Reporter: sivabalan narayanan


Sometime back, we added a fix to hudi, where in we automatically detect if any async table services are enabled and if no lock providers are configured, we automatically enable InProcessLockProvider, OCC and lazy rollbacks. This is a pre-requisite for enabling metadata table and hence we had put in this fix. 

 

This worked out well for COW, clustering. But for MOR, it was tricky, and we had to have explicit checks for below condition and auto enable it

if table type = MOR and if compaction is async -> enable InProcessLockProvider. 

bcoz, for COW there is no compaction, but for MOR, compaction has to be enabled. its a question of whether its inline or async. 

 

This all works out well, if user explicitly sets the compaction config as below
{code:java}
df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
     |   option("hoodie.compact.inline","true").
     |   option("hoodie.compact.inline.max.delta.commits","2").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath) {code}
 

So, we clearly detect its inline and do not enable InProcessLockProvier. 

Auto detection also works well w/ Deltastreamer code path, since we can clearly detect whether compaction is inline or async. for inline, Deltastreamer will explicitly set "hoodie.compact.inline" to "true".

 

But the tricky part is, with spark datasource, if user skips the compaction config altogether, we auto detect that its inline and go ahead and enable inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this is a behavior change for a simple single writer coming from 0.10.0. 

 
{code:java}
df2.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath) {code}
 

Reason is that, as per code, default value for "hoodie.compact.inline" is "false". And so we default deduction is that, compaction is async if user does not explicitly set it.

 

We have to find a way to fix this. 

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)