You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2018/10/04 07:44:00 UTC

[jira] [Commented] (SPARK-25587) NPE in Dataset when reading from Parquet as Product

    [ https://issues.apache.org/jira/browse/SPARK-25587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637900#comment-16637900 ] 

Wenchen Fan commented on SPARK-25587:
-------------------------------------

I believe this is an issue of Spark Shell. Looking at the stack trace, the problem occurs when Spark is trying to get the outer class defined in the REPL. The logic in `OuterScopes.getOuterScope` is heavily coupled with Spark Shell.

cc people who may have more context about the updates in the REPL: [~srowen] [~dbtsai] [~sadhen]

> NPE in Dataset when reading from Parquet as Product
> ---------------------------------------------------
>
>                 Key: SPARK-25587
>                 URL: https://issues.apache.org/jira/browse/SPARK-25587
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: Spark version 2.4.0 (RC2).
> {noformat}
> $ spark-submit --version
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
>       /_/
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
> Branch
> Compiled by user  on 2018-09-27T14:50:10Z
> Revision
> Url
> Type --help for more information.
> {noformat}
>            Reporter: Michael Heuer
>            Priority: Major
>
> In an attempt to replicate the following issue in ADAM, a library downstream of Spark
> https://github.com/bigdatagenomics/adam/issues/2058
> also reported as
> https://issues.apache.org/jira/browse/SPARK-25588
> the following Spark Shell script throws NPE when attempting to read from Parquet.
> {code:scala}
> sc.setLogLevel("INFO")
> import spark.implicits._
> case class Inner(
>   names: Seq[String] = Seq()
> ) 
> case class Outer(
>   inners: Seq[Inner] = Seq()
> )
> val inner = Inner(Seq("name0", "name1"))
> val outer = Outer(Seq(inner))
> val dataset = sc.parallelize(Seq(outer)).toDS()
> val path = "outers.parquet"
> dataset.toDF().write.format("parquet").save(path)
> val roundtrip = spark.read.parquet(path).as[Outer]
> roundtrip.first
> {code}
> Stack trace
> {noformat}
> $ spark-shell -i failure.scala
> ...
> 2018-10-01 16:57:48 INFO  ParquetWriteSupport:54 - Initialized Parquet WriteSupport with Catalyst schema:
> {
>   "type" : "struct",
>   "fields" : [ {
>     "name" : "inners",
>     "type" : {
>       "type" : "array",
>       "elementType" : {
>         "type" : "struct",
>         "fields" : [ {
>           "name" : "names",
>           "type" : {
>             "type" : "array",
>             "elementType" : "string",
>             "containsNull" : true
>           },
>           "nullable" : true,
>           "metadata" : { }
>         } ]
>       },
>       "containsNull" : true
>     },
>     "nullable" : true,
>     "metadata" : { }
>   } ]
> }
> and corresponding Parquet message type:
> message spark_schema {
>   optional group inners (LIST) {
>     repeated group list {
>       optional group element {
>         optional group names (LIST) {
>           repeated group list {
>             optional binary element (UTF8);
>           }
>         }
>       }
>     }
>   }
> }
> 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to file. allocated memory: 0
> 2018-10-01 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to file. allocated memory: 26
> ...
> 2018-10-01 16:57:49 INFO  FileSourceStrategy:54 - Output Data Schema: struct<inners: array<struct<names:array<string>>>>
> 2018-10-01 16:57:49 INFO  FileSourceScanExec:54 - Pushed Filters:
> java.lang.NullPointerException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
>   at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.doGenCode(objects.scala:796)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:296)
>   at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98)
>   at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431)
>   at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:155)
>   at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:152)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:296)
>   at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:152)
>   at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:38)
>   at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1193)
>   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2552)
>   at org.apache.spark.sql.Dataset.first(Dataset.scala:2559)
>   ... 59 elided
> {noformat}
> Regression from Spark version 2.3.1, which runs the above Spark Shell script correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org