You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Fengling Wang (JIRA)" <ji...@apache.org> on 2018/03/23 19:52:00 UTC

[jira] [Updated] (KUDU-2371) Allow Kudu-Spark upsert API to ignore NULL column values

     [ https://issues.apache.org/jira/browse/KUDU-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fengling Wang updated KUDU-2371:
--------------------------------
    Description: 
We've seen cases where users use Spark streaming to process JSON and upsert into Kudu. The JSON file may have rows with fields that are not all specified. In this case, Spark sets the missing column values to NULL. Then when we upsert those records, some existing row values will be replaced by NULL. This is a correct behavior of Kudu-Spark, but not what users desire to see.

 
{noformat}
// This is the original Kudu table.
scala> df.printSchema
root
 |-- key: long (nullable = false)
 |-- int_val: integer (nullable = true)
 |-- string_val: string (nullable = true)

scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|    200|       foo|
+---+-------+----------+
 
// Put JSON string into dataframe with matching schema.
scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
scala> df_from_json.show()
+-------+-------+----------+
|     key|int_val|string_val|
+--------+-------+----------+
|     123|      1|      null|
+--------+-------+----------+


scala> kuduContext.upsertRows(df_from_json, "kudu_table")

// Below is actual result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|      null|
+---+-------+----------+

// Below is the desired result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|       foo|
+---+-------+----------+{noformat}
  

In order to avoid such situation, it's suggested to add an extra flag/option to the upsertRows() API to allow treating NULL as an unset/omit. By doing this, unspecified column values will stay unchanged in the Kudu table.

  was:
We've seen cases where users use Spark streaming to process JSON and upsert into Kudu. The JSON file may have rows with fields that are not all specified. In this case, Spark sets the missing column values to NULL. Then when we upsert those records, some existing row values will be replaced by NULL. This is a correct behavior of Kudu-Spark, but not what users desire to see.

 
{noformat}
// This is the original Kudu table.
scala> df.printSchema
root
 |-- key: long (nullable = false)
 |-- int_val: integer (nullable = true)
 |-- string_val: string (nullable = true)

scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|    200|       foo|
+---+-------+----------+
 
// Put JSON string into dataframe with matching schema.
scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
scala> df_from_json.show()
+-------+-------+----------+
|     key|int_val|string_val|
+--------+-------+----------+
|     123|      1|      null|
+--------+-------+----------+


scala> kuduContext.upsertRows(df_from_json, "kudu_table")

// Below is actual result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|      null|
+---+-------+----------+

// Below is the desired result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|       foo|
+---+-------+----------+{noformat}
 

 

In order to avoid such situation, it's suggested to add an extra flag/option to the upsertRows() API to allow treating NULL as an unset/omit. By doing this, unspecified column values will stay unchanged in the Kudu table.


> Allow Kudu-Spark upsert API to ignore NULL column values
> --------------------------------------------------------
>
>                 Key: KUDU-2371
>                 URL: https://issues.apache.org/jira/browse/KUDU-2371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: spark
>    Affects Versions: 1.6.0
>            Reporter: Fengling Wang
>            Assignee: Fengling Wang
>            Priority: Major
>
> We've seen cases where users use Spark streaming to process JSON and upsert into Kudu. The JSON file may have rows with fields that are not all specified. In this case, Spark sets the missing column values to NULL. Then when we upsert those records, some existing row values will be replaced by NULL. This is a correct behavior of Kudu-Spark, but not what users desire to see.
>  
> {noformat}
> // This is the original Kudu table.
> scala> df.printSchema
> root
>  |-- key: long (nullable = false)
>  |-- int_val: integer (nullable = true)
>  |-- string_val: string (nullable = true)
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|    200|       foo|
> +---+-------+----------+
>  
> // Put JSON string into dataframe with matching schema.
> scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
> scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
> scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
> scala> df_from_json.show()
> +-------+-------+----------+
> |     key|int_val|string_val|
> +--------+-------+----------+
> |     123|      1|      null|
> +--------+-------+----------+
> scala> kuduContext.upsertRows(df_from_json, "kudu_table")
> // Below is actual result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|      1|      null|
> +---+-------+----------+
> // Below is the desired result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|      1|       foo|
> +---+-------+----------+{noformat}
>   
> In order to avoid such situation, it's suggested to add an extra flag/option to the upsertRows() API to allow treating NULL as an unset/omit. By doing this, unspecified column values will stay unchanged in the Kudu table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)