You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Don Drake <do...@gmail.com> on 2017/02/02 20:46:04 UTC

Spark 2 - Creating datasets from dataframes with extra columns

In 1.6, when you created a Dataset from a Dataframe that had extra columns,
the columns not in the case class were dropped from the Dataset.

For example in 1.6, the column c4 is gone:

scala> case class F(f1: String, f2: String, f3:String)

defined class F


scala> import sqlContext.implicits._

import sqlContext.implicits._


scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
"j","z")).toDF("f1", "f2", "f3", "c4")

df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string,
c4: string]


scala> val ds = df.as[F]

ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]


scala> ds.show

+---+---+---+

| f1| f2| f3|

+---+---+---+

|  a|  b|  c|

|  d|  e|  f|

|  h|  i|  j|


This seems to have changed in Spark 2.0 and also 2.1:

Spark 2.1.0:

scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import spark.implicits._
import spark.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more
fields]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more
fields]

scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

Is there a way to get a Dataset that conforms to the case class in Spark
2.1.0?  Basically, I'm attempting to use the case class to define an output
schema, and these extra columns are getting in the way.

Thanks.

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Re: Spark 2 - Creating datasets from dataframes with extra columns

Posted by Don Drake <do...@gmail.com>.

This seems like a bug to me, the schemas should match.

scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders

scala> val fEncoder = Encoders.product[F]
fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]:
string, f3[0]: string]

scala> fEncoder.schema == ds.schema
res2: Boolean = false

scala> ds.schema
res3: org.apache.spark.sql.types.StructType =
StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true),
StructField(c4,StringType,true))

scala> fEncoder.schema
res4: org.apache.spark.sql.types.StructType =
StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true))

I'll open a JIRA.

-Don

On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <do...@gmail.com> wrote:

> In 1.6, when you created a Dataset from a Dataframe that had extra
> columns, the columns not in the case class were dropped from the Dataset.
>
> For example in 1.6, the column c4 is gone:
>
> scala> case class F(f1: String, f2: String, f3:String)
>
> defined class F
>
>
> scala> import sqlContext.implicits._
>
> import sqlContext.implicits._
>
>
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
> "j","z")).toDF("f1", "f2", "f3", "c4")
>
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string,
> c4: string]
>
>
> scala> val ds = df.as[F]
>
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
>
>
> scala> ds.show
>
> +---+---+---+
>
> | f1| f2| f3|
>
> +---+---+---+
>
> |  a|  b|  c|
>
> |  d|  e|  f|
>
> |  h|  i|  j|
>
>
> This seems to have changed in Spark 2.0 and also 2.1:
>
> Spark 2.1.0:
>
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
>
> scala> import spark.implicits._
> import spark.implicits._
>
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i",
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more
> fields]
>
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more
> fields]
>
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
>
> Is there a way to get a Dataset that conforms to the case class in Spark
> 2.1.0?  Basically, I'm attempting to use the case class to define an output
> schema, and these extra columns are getting in the way.
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake <http://www.MailLaunder.com/>
> 800-733-2143 <(800)%20733-2143>
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143