You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Fengling Wang (JIRA)" <ji...@apache.org> on 2018/03/23 19:41:00 UTC

[jira] [Created] (KUDU-2371) Allow Kudu-Spark upsert API to ignore NULL column values

Fengling Wang created KUDU-2371:
-----------------------------------

             Summary: Allow Kudu-Spark upsert API to ignore NULL column values
                 Key: KUDU-2371
                 URL: https://issues.apache.org/jira/browse/KUDU-2371
             Project: Kudu
          Issue Type: Improvement
          Components: spark
    Affects Versions: 1.6.0
            Reporter: Fengling Wang
            Assignee: Fengling Wang


We've seen cases where users use Spark streaming to process JSON and upsert into Kudu. The JSON file may have rows with fields that are not all specified. In this case, Spark sets the missing column values to NULL. Then when we upsert those records, some existing row values will be replaced by NULL. This is a correct behavior of Kudu-Spark, but not what users desire to see.

 
{noformat}
// This is the original Kudu table.
scala> df.printSchema
root
 |-- key: long (nullable = false)
 |-- int_val: integer (nullable = true)
 |-- string_val: string (nullable = true)

scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|    200|       foo|
+---+-------+----------+
 
// Put JSON string into dataframe with matching schema.
scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
scala> df_from_json.show()
+-------+-------+----------+
|     key|int_val|string_val|
+--------+-------+----------+
|     123|      1|      null|
+--------+-------+----------+


scala> kuduContext.upsertRows(df_from_json, "kudu_table")

// Below is actual result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|      null|
+---+-------+----------+

// Below is the desired result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|       foo|
+---+-------+----------+{noformat}
 

 

In order to avoid such situation, it's suggested to add an extra flag/option to the upsertRows() API to allow treating NULL as an unset/omit. By doing this, unspecified column values will stay unchanged in the Kudu table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)