You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chitral Verma <ch...@gmail.com> on 2017/11/19 23:08:48 UTC

[Spark SQL]: DataFrame schema resulting in NullPointerException

Hey,

I'm working on this use case that involves converting DStreams to
Dataframes after some transformations. I've simplified my code into the
following snippet so as to reproduce the error. Also, I've mentioned below
my environment settings.

*Environment:*

Spark Version: 2.2.0
Java: 1.8
Execution mode: local/ IntelliJ


*Code:*

object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  import spark.implicits._

    val df = List(
        ("jim", "usa"),
        ("raj", "india"))
        .toDF("name", "country")

    df.rdd
      .map(x => x.toSeq)
      .map(x => new GenericRowWithSchema(x.toArray, df.schema))
      .foreach(println)
  }
}


This results in NullPointerException as I'm directly using df.schema in
map().

What I don't understand is that if I use the following code (basically
storing the schema as a value before transforming), it works just fine.

object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  import spark.implicits._

    val df = List(
        ("jim", "usa"),
        ("raj", "india"))
        .toDF("name", "country")

    val sc = df.schema

    df.rdd
      .map(x => x.toSeq)
      .map(x => new GenericRowWithSchema(x.toArray, sc))
      .foreach(println)
  }
}


I wonder why this is happening as *df.rdd* is not an action and there is
visible change in state of dataframe just yet. What are your thoughts on
this?

Regards,
Chitral Verma