You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Charlie Feng (JIRA)" <ji...@apache.org> on 2018/11/21 09:30:00 UTC
[jira] [Issue Comment Deleted] (SPARK-26136) Row.getAs return null value in some condition

     [ https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charlie Feng updated SPARK-26136:
---------------------------------
    Comment: was deleted

(was: And I'm thinking when row.getAs() can't infer the type correctly, it should throw an running time exception, instead of return null value.
When user met an exception, he will find the issue and debug code and fix it.
But when spark return null value, user may not notice this error and lead production issue.)

> Row.getAs return null value in some condition
> ---------------------------------------------
>
>                 Key: SPARK-26136
>                 URL: https://issues.apache.org/jira/browse/SPARK-26136
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0, 2.3.2, 2.4.0
>         Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>            Reporter: Charlie Feng
>            Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
>     row.getAs[String]("XYZ").split(",").map { xyz =>
>       var colA: String = row.getAs("A");
>       var col0: String = row.getString(0);
>       (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
>     }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-----+
> | A| B| XYZ|
> +---+---+-----+
> | a1| b1|x,y,z|
> +---+---+-----+
> +------------+------------+------------+------------+-------+---------+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +------------+------------+------------+------------+-------+---------+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +------------+------------+------------+------------+-------+---------+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org