You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2021/05/21 14:25:00 UTC
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

     [ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan updated HUDI-1722:
--------------------------------------
    Status: Closed  (was: Patch Available)

> hive beeline/spark-sql  query specified field on mor table occur NPE
> --------------------------------------------------------------------
>
>                 Key: HUDI-1722
>                 URL: https://issues.apache.org/jira/browse/HUDI-1722
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Hive Integration, Spark Integration
>    Affects Versions: 0.7.0
>         Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1
>            Reporter: tao meng
>            Assignee: tao meng
>            Priority: Major
>              Labels: pull-request-available, sev:critical, user-support-issues
>             Fix For: 0.9.0
>
>
> HUDI-892 introduce this problem。
>  this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。
>  Consider the following questions：
>  we have four getRecordReaders: 
>  reader1(its hoodieRealtimeSplit contains no log files)
>  reader2 (its hoodieRealtimeSplit contains log files)
>  reader3(its hoodieRealtimeSplit contains log files)
>  reader4(its hoodieRealtimeSplit contains no log files)
> now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
> reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
>  which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions
> 2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
>  
> Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok
> sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is not recommended in spark， however hive has no way to avoid this problem。
>  test step：
> step1:
> val df = spark.range(0, 100000).toDF("keyid")
>  .withColumn("col3", expr("keyid"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String] ("sb1", "rz")))
>  .withColumn("a2", lit(Array[String] ("sb1", "rz")))
> // create hoodie table hive_14b
>  merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> notice:  bulk_insert will produce 4 files in hoodie table
> step2:
>  val df = spark.range(99999, 100002).toDF("keyid")
>  .withColumn("col3", expr("keyid"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String] ("sb1", "rz")))
>  .withColumn("a2", lit(Array[String] ("sb1", "rz")))
> // upsert table 
> merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
>  now :  we have four base files and one log file in hoodie table
> step3: 
> spark-sql/beeline: 
>  select count(col3) from hive_14b_rt;
> then the query failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)