You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Josh Rosen (Jira)" <ji...@apache.org> on 2019/08/23 05:17:00 UTC

[jira] [Resolved] (SPARK-28702) Display useful error message (instead of NPE) for invalid Dataset operations (e.g. calling actions inside of transformations)

     [ https://issues.apache.org/jira/browse/SPARK-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen resolved SPARK-28702.
--------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 25503
[https://github.com/apache/spark/pull/25503]

> Display useful error message (instead of NPE) for invalid Dataset operations (e.g. calling actions inside of transformations)
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28702
>                 URL: https://issues.apache.org/jira/browse/SPARK-28702
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Josh Rosen
>            Assignee: Shivu Sondur
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In Spark, SparkContext and SparkSession can only be used on the driver, not on executors. For example, this means that you cannot call {{someDataset.collect()}} inside of a Dataset or RDD transformation.
> When Spark serializes RDDs and Datasets, references to SparkContext and SparkSession are null'ed out (by being marked as {{@transient}} or via the Closure Cleaner). As a result, RDD and Dataset methods which reference use these driver-side-only objects (e.g. actions or transformations) will see {{null}} references and may fail with a {{NullPointerException}}. For example, in code which (via a chain of calls) tried to {{collect()}} a dataset inside of a Dataset.map operation:
> {code:java}Caused by: java.lang.NullPointerException
> at <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027)
> at <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025)
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
> at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)
> [...] {code}
> The resulting NPE can be _very_ confusing to users.
> In SPARK-5063 I added some logic to throw clearer error messages when performing similar invalid actions on RDDs. This ticket's scope is to implement similar logic for Datasets.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org