You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "颜发才 (Yan Facai)" <ya...@gmail.com> on 2016/10/02 11:41:55 UTC

Re: Dataframe, Java: How to convert String to Vector ?

Hi, Perter.

It's interesting that `DecisionTreeRegressor.transformImpl` also use udf to
transform dataframe, instead of using map:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L175


On Wed, Sep 21, 2016 at 10:22 PM, Peter Figliozzi <pe...@gmail.com>
wrote:

> I'm sure there's another way to do it; I hope someone can show us.  I
> couldn't figure out how to use `map` either.
>
> On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>
>> Thanks, Peter.
>> It works!
>>
>> Why udf is needed?
>>
>>
>>
>>
>> On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <
>> pete.figliozzi@gmail.com> wrote:
>>
>>> Hi Yan, I agree, it IS really confusing.  Here is the technique for
>>> transforming a column.  It is very general because you can make "myConvert"
>>> do whatever you want.
>>>
>>> import org.apache.spark.mllib.linalg.Vectors
>>> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>>
>>> df.show()
>>> // The columns were named "_1" and "_2"
>>> // Very confusing, because it looks like a Scala wildcard when we refer
>>> to it in code
>>>
>>> val myConvert = (x: String) => { Vectors.parse(x) }
>>> val myConvertUDF = udf(myConvert)
>>>
>>> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>>>
>>> newDf.show()
>>>
>>> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai) <ya...@gmail.com>
>>> wrote:
>>>
>>>> Hi, all.
>>>> I find that it's really confuse.
>>>>
>>>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>>>
>>>>     scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>>>> Vectors.parse("[2,4,6]"))).toDF
>>>>     dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>>>
>>>>
>>>> But using map to convert String to Vector throws an error:
>>>>
>>>>     scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>>>     dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>>>
>>>>     scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>>>>     <console>:30: error: Unable to find encoder for type stored in a
>>>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>>>> classes) are supported by importing spark.implicits._  Support for
>>>> serializing other types will be added in future releases.
>>>>       dataStr.map(row => Vectors.parse(row.getString(1)))
>>>>
>>>>
>>>> Dose anyone can help me,
>>>> thanks very much!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <
>>>> pete.figliozzi@gmail.com> wrote:
>>>>
>>>>> Hi Yan, I think you'll have to map the features column to a new
>>>>> numerical features column.
>>>>>
>>>>> Here's one way to do the individual transform:
>>>>>
>>>>> scala> val x = "[1, 2, 3, 4, 5]"
>>>>> x: String = [1, 2, 3, 4, 5]
>>>>>
>>>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>>>>> split(" ") map(_.toInt)
>>>>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>>>>
>>>>> If you don't know about the Scala command line, just type "scala" in a
>>>>> terminal window.  It's a good place to try things out.
>>>>>
>>>>> You can make a function out of this transformation and apply it to
>>>>> your features column to make a new column.  Then add this with
>>>>> Dataset.withColumn.
>>>>>
>>>>> See here
>>>>> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column>
>>>>> on how to apply a function to a Column to make a new column.
>>>>>
>>>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <ya...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I have a csv file like:
>>>>>> uid      mid      features       label
>>>>>> 123    5231    [0, 1, 3, ...]    True
>>>>>>
>>>>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>>>>
>>>>>> However, when I read the file:
>>>>>> Dataset<Row> samples = sparkSession.read().csv(file);
>>>>>> The type of samples.select("features") is String.
>>>>>>
>>>>>> My question is:
>>>>>> How to map samples.select("features") to Vector or any appropriate
>>>>>> type,
>>>>>> so I can use it to train like:
>>>>>>         GBTClassifier gbdt = new GBTClassifier()
>>>>>>                 .setLabelCol("label")
>>>>>>                 .setFeaturesCol("features")
>>>>>>                 .setMaxIter(2)
>>>>>>                 .setMaxDepth(7);
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>