You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Dan Bikle <bi...@gmail.com> on 2016/09/25 11:57:19 UTC

In Spark-Scala, how to copy Array of Lists into new DataFrame?

Hello World,

I am familiar with Python and I am learning Spark-Scala.

I want to build a DataFrame which has structure desribed by this syntax:








*// Prepare training data from a list of (label, features) tuples.val
training = spark.createDataFrame(Seq(  (1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),  (3.0, Vectors.dense(1.3, 1.0)),  (1.0,
Vectors.dense(1.2, -0.5)))).toDF("label", "features")*
I got the above syntax from this URL:

http://spark.apache.org/docs/latest/ml-pipeline.html

Currently my data is in array which I had pulled out of a DF:


*val my_a = gspc17_df.collect().map{row =>
Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}*
The structure of my array is very similar to the above DF:






*my_a: Array[Seq[Any]] =Array(  List(-1.4830674013266898,
[-0.004192832940431825,-0.003170667657263393]),  List(-0.05876766500768526,
[-0.008462913654529357,-0.006880595828929472]),  List(1.0109273250546658,
[-3.1816797620416693E-4,-0.006502619326182358]))*
How to copy data from my array into a DataFrame which has the above
structure?

I tried this syntax:


*val my_df = spark.createDataFrame(my_a).toDF("label","features")*
Spark barked at me:










*<console>:105: error: inferred type arguments [Seq[Any]] do not conform to
method createDataFrame's type parameter bounds [A <: Product]       val
my_df =
spark.createDataFrame(my_a).toDF("label","features")
^<console>:105: error: type mismatch; found   :
scala.collection.mutable.WrappedArray[Seq[Any]] required: Seq[A]       val
my_df =
spark.createDataFrame(my_a).toDF("label","features")
^scala> *

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

Posted by Marco Mistroni <mm...@gmail.com>.
Hi
 in fact i have  just found  some written notes in my code.... see if this
docs help you (it will work with any spark versions, not only 1.3.0)

https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes
hth


On Sun, Sep 25, 2016 at 1:25 PM, Marco Mistroni <mm...@gmail.com> wrote:

> Hi
>
>  i must admit , i had issues as well in finding a  sample that does that,
> (hopefully Spark folks can add more examples or someone on the list can
> post a sample code?)
>
> hopefully you can reuse sample below
> So,  you start from an rdd of doubles (myRdd)
>
> ## make a row
> val toRddOfRows = myRdd.map(doubleValues => Row.fromSeq(doubleValues)
>
> # then you can either call toDF directly. spk will build a schema for
> you..beware you will need to import   import org.apache.spark.sql.
> SQLImplicits
>
> val df = toRddOfRows.toDF()
>
> # or you can create a schema  yourself
> def createSchema(row: Row) = {
>     val first = row.toSeq
>     val firstWithIdx = first.zipWithIndex
>     val fields = firstWithIdx.map(tpl => StructField("Col" + tpl._2,
> DoubleType, false))
>     StructType(fields)
>
>   }
>
> val mySchema =  createSchema(toRddOfRow.first())
>
> // returning DataFrame
> val mydf =   sqlContext.createDataFrame(toRddOfRow, schema)
>
>
> hth
>
>
>
>
>
> U need to define a schema to make a df out of your list... check spark
> docs on how to make a df or some machine learning examples
>
> On 25 Sep 2016 12:57 pm, "Dan Bikle" <bi...@gmail.com> wrote:
>
>> Hello World,
>>
>> I am familiar with Python and I am learning Spark-Scala.
>>
>> I want to build a DataFrame which has structure desribed by this syntax:
>>
>>
>>
>>
>>
>>
>>
>>
>> *// Prepare training data from a list of (label, features) tuples.val
>> training = spark.createDataFrame(Seq(  (1.1, Vectors.dense(1.1, 0.1)),
>> (0.2, Vectors.dense(1.0, -1.0)),  (3.0, Vectors.dense(1.3, 1.0)),  (1.0,
>> Vectors.dense(1.2, -0.5)))).toDF("label", "features")*
>> I got the above syntax from this URL:
>>
>> http://spark.apache.org/docs/latest/ml-pipeline.html
>>
>> Currently my data is in array which I had pulled out of a DF:
>>
>>
>> *val my_a = gspc17_df.collect().map{row =>
>> Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}*
>> The structure of my array is very similar to the above DF:
>>
>>
>>
>>
>>
>>
>> *my_a: Array[Seq[Any]] =Array(  List(-1.4830674013266898,
>> [-0.004192832940431825,-0.003170667657263393]),  List(-0.05876766500768526,
>> [-0.008462913654529357,-0.006880595828929472]),  List(1.0109273250546658,
>> [-3.1816797620416693E-4,-0.006502619326182358]))*
>> How to copy data from my array into a DataFrame which has the above
>> structure?
>>
>> I tried this syntax:
>>
>>
>> *val my_df = spark.createDataFrame(my_a).toDF("label","features")*
>> Spark barked at me:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *<console>:105: error: inferred type arguments [Seq[Any]] do not conform
>> to method createDataFrame's type parameter bounds [A <: Product]       val
>> my_df =
>> spark.createDataFrame(my_a).toDF("label","features")
>> ^<console>:105: error: type mismatch; found   :
>> scala.collection.mutable.WrappedArray[Seq[Any]] required: Seq[A]       val
>> my_df =
>> spark.createDataFrame(my_a).toDF("label","features")
>> ^scala> *
>>
>

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

Posted by Marco Mistroni <mm...@gmail.com>.
Hi

 i must admit , i had issues as well in finding a  sample that does that,
(hopefully Spark folks can add more examples or someone on the list can
post a sample code?)

hopefully you can reuse sample below
So,  you start from an rdd of doubles (myRdd)

## make a row
val toRddOfRows = myRdd.map(doubleValues => Row.fromSeq(doubleValues)

# then you can either call toDF directly. spk will build a schema for
you..beware you will need to import   import
org.apache.spark.sql.SQLImplicits

val df = toRddOfRows.toDF()

# or you can create a schema  yourself
def createSchema(row: Row) = {
    val first = row.toSeq
    val firstWithIdx = first.zipWithIndex
    val fields = firstWithIdx.map(tpl => StructField("Col" + tpl._2,
DoubleType, false))
    StructType(fields)

  }

val mySchema =  createSchema(toRddOfRow.first())

// returning DataFrame
val mydf =   sqlContext.createDataFrame(toRddOfRow, schema)


hth





U need to define a schema to make a df out of your list... check spark docs
on how to make a df or some machine learning examples

On 25 Sep 2016 12:57 pm, "Dan Bikle" <bi...@gmail.com> wrote:

> Hello World,
>
> I am familiar with Python and I am learning Spark-Scala.
>
> I want to build a DataFrame which has structure desribed by this syntax:
>
>
>
>
>
>
>
>
> *// Prepare training data from a list of (label, features) tuples.val
> training = spark.createDataFrame(Seq(  (1.1, Vectors.dense(1.1, 0.1)),
> (0.2, Vectors.dense(1.0, -1.0)),  (3.0, Vectors.dense(1.3, 1.0)),  (1.0,
> Vectors.dense(1.2, -0.5)))).toDF("label", "features")*
> I got the above syntax from this URL:
>
> http://spark.apache.org/docs/latest/ml-pipeline.html
>
> Currently my data is in array which I had pulled out of a DF:
>
>
> *val my_a = gspc17_df.collect().map{row =>
> Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}*
> The structure of my array is very similar to the above DF:
>
>
>
>
>
>
> *my_a: Array[Seq[Any]] =Array(  List(-1.4830674013266898,
> [-0.004192832940431825,-0.003170667657263393]),  List(-0.05876766500768526,
> [-0.008462913654529357,-0.006880595828929472]),  List(1.0109273250546658,
> [-3.1816797620416693E-4,-0.006502619326182358]))*
> How to copy data from my array into a DataFrame which has the above
> structure?
>
> I tried this syntax:
>
>
> *val my_df = spark.createDataFrame(my_a).toDF("label","features")*
> Spark barked at me:
>
>
>
>
>
>
>
>
>
>
> *<console>:105: error: inferred type arguments [Seq[Any]] do not conform
> to method createDataFrame's type parameter bounds [A <: Product]       val
> my_df =
> spark.createDataFrame(my_a).toDF("label","features")
> ^<console>:105: error: type mismatch; found   :
> scala.collection.mutable.WrappedArray[Seq[Any]] required: Seq[A]       val
> my_df =
> spark.createDataFrame(my_a).toDF("label","features")
> ^scala> *
>