You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Sagar Sumit (Jira)" <ji...@apache.org> on 2021/11/05 16:34:00 UTC
[jira] [Comment Edited] (HUDI-2558) Clustering w/ sort columns with null values fails

    [ https://issues.apache.org/jira/browse/HUDI-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439242#comment-17439242 ] 

Sagar Sumit edited comment on HUDI-2558 at 11/5/21, 4:33 PM:
-------------------------------------------------------------

Hudi is [simply returning null|https://github.com/apache/hudi/blob/3af6568d316f410184e3d4dcfdbf00a8802b1fb8/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java] for the column with null value. Subsequently when the rdd is sorted, [Spark does a compare|https://github.com/apache/spark/blob/v2.4.7/core/src/main/java/org/apache/spark/util/collection/TimSort.java#L270-L277] which requires keys being compared to be non-null.

We can try to make this behavior configurable i.e. replace user-configured default value for nulls. However, I think it's best to retain this behavior and document it. There is some discussion in the Guava community around this behavior. Refer https://github.com/google/guava/issues/5460

A simple workaround is to give default values after reading dataframe but before writing Hudi table:

{code:java}
df  = df.fillna( {'sort_column': 'default_value'} )
{code}

cc [~vinoth] [~satishkotha]


was (Author: codope):
Hudi is [simply returning null|https://github.com/apache/hudi/blob/3af6568d316f410184e3d4dcfdbf00a8802b1fb8/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java] for the column with null value. Subsequently when the rdd is sorted, [Spark does a compare|https://github.com/apache/spark/blob/v2.4.7/core/src/main/java/org/apache/spark/util/collection/TimSort.java#L270-L277] which requires keys being compared to be non-null.

We can try to make this behavior configurable i.e. replace user-configured default value for nulls. However, I think it's best to retain this behavior and document it. There is some discussion in the Guava community around this behavior. Refer https://github.com/google/guava/issues/5460

A simple workaround is to give default values after reading dataframe but before writing Hudi table:

{code:java}
df  = df.fillna( {'sort_column': 'default_value'} )
{code}



> Clustering w/ sort columns with null values fails
> -------------------------------------------------
>
>                 Key: HUDI-2558
>                 URL: https://issues.apache.org/jira/browse/HUDI-2558
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Writer Core
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: sev:critical, user-support-issues
>             Fix For: 0.10.0
>
>
> https://github.com/apache/hudi/issues/3766



--
This message was sent by Atlassian Jira
(v8.3.4#803005)