You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Raymond Xu (Jira)" <ji...@apache.org> on 2023/03/09 17:05:00 UTC
[jira] [Updated] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null

     [ https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raymond Xu updated HUDI-5688:
-----------------------------
    Fix Version/s: 0.13.1
                   0.12.3

> schema field of EmptyRelation subtype of BaseRelation should not be null
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5688
>                 URL: https://issues.apache.org/jira/browse/HUDI-5688
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: core
>            Reporter: Pramod Biligiri
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.13.1, 0.12.3
>
>         Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined schema for it as well (as represented by the userSpecifiedSchema field in DataSource.scala), then the EmptyRelation returned by DefaultSource.createRelation sets schema of the EmptyRelation to null. This breaks the contract of Spark's BaseRelation, where the schema is a StructType but is not expected to be null.
> Module versions: current apache-hudi master (commit hash abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
>             .option("hoodie.datasource.query.type", "incremental") .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
>   at org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
>   at org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
>   at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
>   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>   at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
>   ... 50 elided  
> Find attached a few screenshots which show the code flow and the buggy state of the variables. Also find attached a Java file and pom.xml that can be used to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change: [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira (https://issues.apache.org/jira/browse/HUDI-4363) and PR (https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)