You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Eldhose Paul (Jira)" <ji...@apache.org> on 2021/05/11 22:41:00 UTC
[jira] [Created] (HUDI-1894) NULL values in timestamp column
defaulted
Eldhose Paul created HUDI-1894:
----------------------------------
Summary: NULL values in timestamp column defaulted
Key: HUDI-1894
URL: https://issues.apache.org/jira/browse/HUDI-1894
Project: Apache Hudi
Issue Type: Bug
Components: Spark Integration
Reporter: Eldhose Paul
Reading timestamp column from hudi and underlying parquet files in spark gives different results.
*hudi properties:*
{code:java}
hdfs dfs -cat /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
#Properties saved on Tue May 11 17:17:22 EDT 2021
#Tue May 11 17:17:22 EDT 2021
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.table.name=jiraissue
hoodie.archivelog.folder=archived
hoodie.table.type=MERGE_ON_READ
hoodie.table.version=1
hoodie.timeline.layout.version=1
{code}
*Reading directly from parquet using Spark:*
{code:java}
scala> val ji = spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 49 more fields]scala> ji.filter($"id" === 1237858).withColumn("inputfile", input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", $"_hoodie_record_key", $"_hoodie_partition_path", $"_hoodie_file_name",$"resolutiondate", $"archiveddate", $"inputfile").show(false)
+-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |resolutiondate|archiveddate|inputfile |
+-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|20210511171722 |20210511171722_7_13718|1237858.0 | |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null |null |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet |
|20210511171722 |20210511171722_7_13718|1237858.0 | |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null |null |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
+-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
{code}
*Reading `hudi` using Spark:*
{code:java}
scala> val jih = spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", $"_hoodie_record_key", $"_hoodie_partition_path", $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
+-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |resolutiondate |archiveddate |
+-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
|20210511171722 |20210511171722_7_13718|1237858.0 | |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|2018-07-30 14:58:52|1969-12-31 19:00:00|
+-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)