You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2015/03/31 13:10:50 UTC

Unable to save dataframe with UDT created with sqlContext.createDataFrame

Hi all,

DataFrame with an user defined type (here mllib.Vector) created with
sqlContex.createDataFrame can't be saved to parquet file and raise
ClassCastException:
org.apache.spark.mllib.linalg.DenseVector cannot be cast to
org.apache.spark.sql.Row error.

Here is an example of code to reproduce this error :






















*object TestDataFrame {  def main(args: Array[String]): Unit = {
//System.loadLibrary(Core.NATIVE_LIBRARY_NAME)    val conf = new
SparkConf().setAppName("RankingEval").setMaster("local[8]")
.set("spark.executor.memory", "6g")    val sc = new SparkContext(conf)
   val sqlContext = new SQLContext(sc)    import
sqlContext.implicits._    val data =
sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))    val dataDF
= data.toDF    dataDF.save("test1.parquet")    val dataDF2 =
sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)
dataDF2.save("test2.parquet")  }}*


Is this related to https://issues.apache.org/jira/browse/SPARK-5532
and how can it be solved ?


Cheers,


Jao

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

Posted by Jaonary Rabarisoa <ja...@gmail.com>.
Good! Thank you.

On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng <me...@gmail.com> wrote:

> I reproduced the bug on master and submitted a patch for it:
> https://github.com/apache/spark/pull/5329. It may get into Spark
> 1.3.1. Thanks for reporting the bug! -Xiangrui
>
> On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa <ja...@gmail.com>
> wrote:
> > Hmm, I got the same error with the master. Here is another test example
> that
> > fails. Here, I explicitly create
> > a Row RDD which corresponds to the use case I am in :
> >
> > object TestDataFrame {
> >
> >   def main(args: Array[String]): Unit = {
> >
> >     val conf = new
> > SparkConf().setAppName("TestDataFrame").setMaster("local[4]")
> >     val sc = new SparkContext(conf)
> >     val sqlContext = new SQLContext(sc)
> >
> >     import sqlContext.implicits._
> >
> >     val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
> >     val dataDF = sc.parallelize(data).toDF
> >
> >     dataDF.printSchema()
> >     dataDF.save("test1.parquet") // OK
> >
> >     val dataRow = data.map {case LabeledPoint(l: Double, f:
> > mllib.linalg.Vector)=>
> >       Row(l,f)
> >     }
> >
> >     val dataRowRDD = sc.parallelize(dataRow)
> >     val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
> >
> >     dataDF2.printSchema()
> >
> >     dataDF2.saveAsParquetFile("test3.parquet") // FAIL !!!
> >   }
> > }
> >
> >
> > On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>
> >> I cannot reproduce this error on master, but I'm not aware of any
> >> recent bug fixes that are related. Could you build and try the current
> >> master? -Xiangrui
> >>
> >> On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa <ja...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > DataFrame with an user defined type (here mllib.Vector) created with
> >> > sqlContex.createDataFrame can't be saved to parquet file and raise
> >> > ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot
> be
> >> > cast
> >> > to org.apache.spark.sql.Row error.
> >> >
> >> > Here is an example of code to reproduce this error :
> >> >
> >> > object TestDataFrame {
> >> >
> >> >   def main(args: Array[String]): Unit = {
> >> >     //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
> >> >     val conf = new
> >> > SparkConf().setAppName("RankingEval").setMaster("local[8]")
> >> >       .set("spark.executor.memory", "6g")
> >> >
> >> >     val sc = new SparkContext(conf)
> >> >     val sqlContext = new SQLContext(sc)
> >> >
> >> >     import sqlContext.implicits._
> >> >
> >> >     val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))
> >> >     val dataDF = data.toDF
> >> >
> >> >     dataDF.save("test1.parquet")
> >> >
> >> >     val dataDF2 = sqlContext.createDataFrame(dataDF.rdd,
> dataDF.schema)
> >> >
> >> >     dataDF2.save("test2.parquet")
> >> >   }
> >> > }
> >> >
> >> >
> >> > Is this related to https://issues.apache.org/jira/browse/SPARK-5532
> and
> >> > how
> >> > can it be solved ?
> >> >
> >> >
> >> > Cheers,
> >> >
> >> >
> >> > Jao
> >
> >
>

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

Posted by Xiangrui Meng <me...@gmail.com>.
I reproduced the bug on master and submitted a patch for it:
https://github.com/apache/spark/pull/5329. It may get into Spark
1.3.1. Thanks for reporting the bug! -Xiangrui

On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa <ja...@gmail.com> wrote:
> Hmm, I got the same error with the master. Here is another test example that
> fails. Here, I explicitly create
> a Row RDD which corresponds to the use case I am in :
>
> object TestDataFrame {
>
>   def main(args: Array[String]): Unit = {
>
>     val conf = new
> SparkConf().setAppName("TestDataFrame").setMaster("local[4]")
>     val sc = new SparkContext(conf)
>     val sqlContext = new SQLContext(sc)
>
>     import sqlContext.implicits._
>
>     val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
>     val dataDF = sc.parallelize(data).toDF
>
>     dataDF.printSchema()
>     dataDF.save("test1.parquet") // OK
>
>     val dataRow = data.map {case LabeledPoint(l: Double, f:
> mllib.linalg.Vector)=>
>       Row(l,f)
>     }
>
>     val dataRowRDD = sc.parallelize(dataRow)
>     val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
>
>     dataDF2.printSchema()
>
>     dataDF2.saveAsParquetFile("test3.parquet") // FAIL !!!
>   }
> }
>
>
> On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> I cannot reproduce this error on master, but I'm not aware of any
>> recent bug fixes that are related. Could you build and try the current
>> master? -Xiangrui
>>
>> On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa <ja...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > DataFrame with an user defined type (here mllib.Vector) created with
>> > sqlContex.createDataFrame can't be saved to parquet file and raise
>> > ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be
>> > cast
>> > to org.apache.spark.sql.Row error.
>> >
>> > Here is an example of code to reproduce this error :
>> >
>> > object TestDataFrame {
>> >
>> >   def main(args: Array[String]): Unit = {
>> >     //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
>> >     val conf = new
>> > SparkConf().setAppName("RankingEval").setMaster("local[8]")
>> >       .set("spark.executor.memory", "6g")
>> >
>> >     val sc = new SparkContext(conf)
>> >     val sqlContext = new SQLContext(sc)
>> >
>> >     import sqlContext.implicits._
>> >
>> >     val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))
>> >     val dataDF = data.toDF
>> >
>> >     dataDF.save("test1.parquet")
>> >
>> >     val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)
>> >
>> >     dataDF2.save("test2.parquet")
>> >   }
>> > }
>> >
>> >
>> > Is this related to https://issues.apache.org/jira/browse/SPARK-5532 and
>> > how
>> > can it be solved ?
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Jao
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

Posted by Jaonary Rabarisoa <ja...@gmail.com>.
Hmm, I got the same error with the master. Here is another test example
that fails. Here, I explicitly create
a Row RDD which corresponds to the use case I am in :









*object TestDataFrame {  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("TestDataFrame").setMaster("local[4]")
   val sc = new SparkContext(conf)    val sqlContext = new
SQLContext(sc)*
*    import sqlContext.implicits._*

















*    val data = Seq(LabeledPoint(1, Vectors.zeros(10)))    val dataDF
= sc.parallelize(data).toDF    dataDF.printSchema()
dataDF.save("test1.parquet") // OK    val dataRow = data.map {case
LabeledPoint(l: Double, f: mllib.linalg.Vector)=>      Row(l,f)    }
 val dataRowRDD = sc.parallelize(dataRow)    val dataDF2 =
sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
dataDF2.printSchema()    dataDF2.saveAsParquetFile("test3.parquet") //
FAIL !!!  }}*


On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng <me...@gmail.com> wrote:

> I cannot reproduce this error on master, but I'm not aware of any
> recent bug fixes that are related. Could you build and try the current
> master? -Xiangrui
>
> On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa <ja...@gmail.com>
> wrote:
> > Hi all,
> >
> > DataFrame with an user defined type (here mllib.Vector) created with
> > sqlContex.createDataFrame can't be saved to parquet file and raise
> > ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be
> cast
> > to org.apache.spark.sql.Row error.
> >
> > Here is an example of code to reproduce this error :
> >
> > object TestDataFrame {
> >
> >   def main(args: Array[String]): Unit = {
> >     //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
> >     val conf = new
> > SparkConf().setAppName("RankingEval").setMaster("local[8]")
> >       .set("spark.executor.memory", "6g")
> >
> >     val sc = new SparkContext(conf)
> >     val sqlContext = new SQLContext(sc)
> >
> >     import sqlContext.implicits._
> >
> >     val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))
> >     val dataDF = data.toDF
> >
> >     dataDF.save("test1.parquet")
> >
> >     val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)
> >
> >     dataDF2.save("test2.parquet")
> >   }
> > }
> >
> >
> > Is this related to https://issues.apache.org/jira/browse/SPARK-5532 and
> how
> > can it be solved ?
> >
> >
> > Cheers,
> >
> >
> > Jao
>

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

Posted by Xiangrui Meng <me...@gmail.com>.
I cannot reproduce this error on master, but I'm not aware of any
recent bug fixes that are related. Could you build and try the current
master? -Xiangrui

On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa <ja...@gmail.com> wrote:
> Hi all,
>
> DataFrame with an user defined type (here mllib.Vector) created with
> sqlContex.createDataFrame can't be saved to parquet file and raise
> ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast
> to org.apache.spark.sql.Row error.
>
> Here is an example of code to reproduce this error :
>
> object TestDataFrame {
>
>   def main(args: Array[String]): Unit = {
>     //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
>     val conf = new
> SparkConf().setAppName("RankingEval").setMaster("local[8]")
>       .set("spark.executor.memory", "6g")
>
>     val sc = new SparkContext(conf)
>     val sqlContext = new SQLContext(sc)
>
>     import sqlContext.implicits._
>
>     val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))
>     val dataDF = data.toDF
>
>     dataDF.save("test1.parquet")
>
>     val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)
>
>     dataDF2.save("test2.parquet")
>   }
> }
>
>
> Is this related to https://issues.apache.org/jira/browse/SPARK-5532 and how
> can it be solved ?
>
>
> Cheers,
>
>
> Jao

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org