You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Vinayak Joshi5 <vi...@in.ibm.com> on 2017/01/31 10:25:36 UTC

Spark SQL Dataframe resulting from an except( ) is unusable

With Spark 2.x, I construct a Dataframe from a sample libsvm file:

scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm")
higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: 
vector]


Then, build a new dataframe that involves an except( )

scala> val train_df = higgsDF.sample(false, 0.7, 42)
train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
double, features: vector]

scala> val test_df = input_df.except(train_df)
test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
double, features: vector]

Now, most operations on the test_df fail with this exception:

scala> test_df.show()
java.lang.RuntimeException: no default for type 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  at 
org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
   .
   .

Debugging this, I see that this is the schema of this dataframe:

scala> test_df.schema
res4: org.apache.spark.sql.types.StructType = 
StructType(StructField(label,DoubleType,true), 
StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

Looking a little deeper, the error occurs because the QueryPlanner ends up 
inside

  object ExtractEquiJoinKeys 
(/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala)

where it processes a LeftAnti Join. Then there is an attempt to generate a 
default Literal value for the org.apache.spark.ml.linalg.VectorUDT 
DataType which fails with the above exception. This is because there is no 
match for the VectorUDT in

def default(dataType: DataType): Literal = {..} 
(/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala)


Any processing on this dataframe that causes Spark to build a query plan 
(i.e. almost all productive uses of this dataframe) fails due to this 
exception. 

Is it a miss in the Literal implementation that it does not handle 
UserDefinedTypes or is it left out intentionally? Is there a way to get 
around this problem? This problem seems to be present in all 2.x version.

Regards,
Vinayak Joshi

Re: Spark SQL Dataframe resulting from an except( ) is unusable

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

Hi Vinayak,

Thanks for reporting this.

I don't think it is left out intentionally for UserDefinedType. If you
already know how the UDT is represented in internal format, you can
explicitly convert the UDT column to other SQL types, then you may get
around this problem. It is a bit hacky, anyway.

I submitted a PR to fix this, but not sure if it will get in the master
soon.


vijoshi wrote
> With Spark 2.x, I construct a Dataframe from a sample libsvm file:
> 
> scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm")
> higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: 
> vector]
> 
> 
> Then, build a new dataframe that involves an except( )
> 
> scala> val train_df = higgsDF.sample(false, 0.7, 42)
> train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> 
> scala> val test_df = input_df.except(train_df)
> test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> 
> Now, most operations on the test_df fail with this exception:
> 
> scala> test_df.show()
> java.lang.RuntimeException: no default for type 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
>   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
>   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
>    .
>    .
> 
> Debugging this, I see that this is the schema of this dataframe:
> 
> scala> test_df.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(label,DoubleType,true), 
> StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> 
> Looking a little deeper, the error occurs because the QueryPlanner ends up 
> inside
> 
>   object ExtractEquiJoinKeys 
> (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala)
> 
> where it processes a LeftAnti Join. Then there is an attempt to generate a 
> default Literal value for the org.apache.spark.ml.linalg.VectorUDT 
> DataType which fails with the above exception. This is because there is no 
> match for the VectorUDT in
> 
> def default(dataType: DataType): Literal = {..} 
> (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala)
> 
> 
> Any processing on this dataframe that causes Spark to build a query plan 
> (i.e. almost all productive uses of this dataframe) fails due to this 
> exception. 
> 
> Is it a miss in the Literal implementation that it does not handle 
> UserDefinedTypes or is it left out intentionally? Is there a way to get 
> around this problem? This problem seems to be present in all 2.x version.
> 
> Regards,
> Vinayak Joshi





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Dataframe-resulting-from-an-except-is-unusable-tp20802p20812.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org