You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anurag Agarwal (JIRA)" <ji...@apache.org> on 2017/11/21 09:44:00 UTC
[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

    [ https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260481#comment-16260481 ] 

Anurag Agarwal commented on SPARK-18859:
----------------------------------------

Another workaround for prostgres databases can be to make dblink and query via dblink, because at the end you need to mention the target schema. This avoids confusion for the postgres driver and helps it identify nullability properly from a query.
I tried the approach by [~mentegy] and it works too, but somehow I was not having create view permission on source database and it made me look around for another approach.

> Catalyst codegen does not mark column as nullable when it should. Causes NPE
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-18859
>                 URL: https://issues.apache.org/jira/browse/SPARK-18859
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.2
>            Reporter: Mykhailo Osypov
>
> When joining two tables via LEFT JOIN, columns in right table may be NULLs, however catalyst codegen cannot recognize it.
> Example:
> {code:title=schema.sql|borderStyle=solid}
> create table masterdata.testtable(
>   id int not null,
>   age int
> );
> create table masterdata.jointable(
>   id int not null,
>   name text not null
> );
> {code}
> {code:title=query_to_select.sql|borderStyle=solid}
> (select t.id, t.age, j.name from masterdata.testtable t left join masterdata.jointable j on t.id = j.id) as testtable;
> {code}
> {code:title=master code|borderStyle=solid}
> val df = sqlContext
>       .read
>       .format("jdbc")
>       .option("dbTable", "query to select")
>       ....
>       .load
> //df generated schema
> /*
> root
>  |-- id: integer (nullable = false)
>  |-- age: integer (nullable = true)
>  |-- name: string (nullable = false)
> */
> {code}
> {code:title=Codegen|borderStyle=solid}
> /* 038 */       scan_rowWriter.write(0, scan_value);
> /* 039 */
> /* 040 */       if (scan_isNull1) {
> /* 041 */         scan_rowWriter.setNullAt(1);
> /* 042 */       } else {
> /* 043 */         scan_rowWriter.write(1, scan_value1);
> /* 044 */       }
> /* 045 */
> /* 046 */       scan_rowWriter.write(2, scan_value2);
> {code}
> Since *j.name* is from right table of *left join* query, it may be null. However generated schema doesn't think so (probably because it defined as *name text not null*)
> {code:title=StackTrace|borderStyle=solid}
> java.lang.NullPointerException
> 	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
> 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> 	at org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146)
> 	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
> 	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
> 	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:86)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org