You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "颜发才 (Yan Facai)" <ya...@gmail.com> on 2016/11/17 04:08:26 UTC

Best practice for preprocessing feature with DataFrame

Hi,
I have a sample, like:
+---+------+--------------------+
|age|gender|             city_id|
+---+------+--------------------+
|   |     1|1042015:city_2044...|
|90s|     2|1042015:city_2035...|
|80s|     2|1042015:city_2061...|
+---+------+--------------------+

and expectation is:
"age":  90s -> 90, 80s -> 80
"gender": 1 -> "male", 2 -> "female"

I have two solutions:
1. Handle each column separately,  and then join all by index.
    val age = input.select("age").map(...)
    val gender = input.select("gender").map(...)
    val result = ...

2. Write utf function for each column, and then use in together:
     val result = input.select(ageUDF($"age"), genderUDF($"gender"))

However, both are awkward,

Does anyone have a better work flow?
Write some custom Transforms and use pipeline?

Thanks.

Re: Best practice for preprocessing feature with DataFrame

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.

Thanks, White.

On Thu, Nov 17, 2016 at 11:15 PM, Stuart White <st...@gmail.com>
wrote:

> Sorry.  Small typo.  That last part should be:
>
> val modifiedRows = rows
>   .select(
>     substring('age, 0, 2) as "age",
>     when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
>   )
> modifiedRows.show
>
> +---+-------+
> |age| gender|
> +---+-------+
> | 90|   male|
> | 80| female|
> | 80|unknown|
> +---+-------+
>
> On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <st...@gmail.com>
> wrote:
> > import org.apache.spark.sql.functions._
> >
> > val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> > rows.show
> >
> > +---+------+
> > |age|gender|
> > +---+------+
> > |90s|     1|
> > |80s|     2|
> > |80s|     3|
> > +---+------+
> >
> > val modifiedRows
> >   .select(
> >     substring('age, 0, 2) as "age",
> >     when('gender === 1, "male").otherwise(when('gender === 2,
> > "female").otherwise("unknown")) as "gender"
> >   )
> > modifiedRows.show
> >
> > +---+-------+
> > |age| gender|
> > +---+-------+
> > | 90|   male|
> > | 80| female|
> > | 80|unknown|
> > +---+-------+
> >
> > On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <ya...@gmail.com>
> wrote:
> >> Could you give me an example, how to use Column function?
> >> Thanks very much.
> >>
> >> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <divya.htconex@gmail.com
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> You can use the Column functions provided by Spark API
> >>>
> >>>
> >>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
> >>>
> >>> Hope this helps .
> >>>
> >>> Thanks,
> >>> Divya
> >>>
> >>>
> >>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>> I have a sample, like:
> >>>> +---+------+--------------------+
> >>>> |age|gender|             city_id|
> >>>> +---+------+--------------------+
> >>>> |   |     1|1042015:city_2044...|
> >>>> |90s|     2|1042015:city_2035...|
> >>>> |80s|     2|1042015:city_2061...|
> >>>> +---+------+--------------------+
> >>>>
> >>>> and expectation is:
> >>>> "age":  90s -> 90, 80s -> 80
> >>>> "gender": 1 -> "male", 2 -> "female"
> >>>>
> >>>> I have two solutions:
> >>>> 1. Handle each column separately,  and then join all by index.
> >>>>     val age = input.select("age").map(...)
> >>>>     val gender = input.select("gender").map(...)
> >>>>     val result = ...
> >>>>
> >>>> 2. Write utf function for each column, and then use in together:
> >>>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
> >>>>
> >>>> However, both are awkward,
> >>>>
> >>>> Does anyone have a better work flow?
> >>>> Write some custom Transforms and use pipeline?
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>>
> >>>
> >>
>

Re: Best practice for preprocessing feature with DataFrame

Posted by Stuart White <st...@gmail.com>.

Sorry.  Small typo.  That last part should be:

val modifiedRows = rows
  .select(
    substring('age, 0, 2) as "age",
    when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
  )
modifiedRows.show

+---+-------+
|age| gender|
+---+-------+
| 90|   male|
| 80| female|
| 80|unknown|
+---+-------+

On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <st...@gmail.com> wrote:
> import org.apache.spark.sql.functions._
>
> val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> rows.show
>
> +---+------+
> |age|gender|
> +---+------+
> |90s|     1|
> |80s|     2|
> |80s|     3|
> +---+------+
>
> val modifiedRows
>   .select(
>     substring('age, 0, 2) as "age",
>     when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
>   )
> modifiedRows.show
>
> +---+-------+
> |age| gender|
> +---+-------+
> | 90|   male|
> | 80| female|
> | 80|unknown|
> +---+-------+
>
> On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>> Could you give me an example, how to use Column function?
>> Thanks very much.
>>
>> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <di...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> You can use the Column functions provided by Spark API
>>>
>>>
>>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>>
>>> Hope this helps .
>>>
>>> Thanks,
>>> Divya
>>>
>>>
>>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> I have a sample, like:
>>>> +---+------+--------------------+
>>>> |age|gender|             city_id|
>>>> +---+------+--------------------+
>>>> |   |     1|1042015:city_2044...|
>>>> |90s|     2|1042015:city_2035...|
>>>> |80s|     2|1042015:city_2061...|
>>>> +---+------+--------------------+
>>>>
>>>> and expectation is:
>>>> "age":  90s -> 90, 80s -> 80
>>>> "gender": 1 -> "male", 2 -> "female"
>>>>
>>>> I have two solutions:
>>>> 1. Handle each column separately,  and then join all by index.
>>>>     val age = input.select("age").map(...)
>>>>     val gender = input.select("gender").map(...)
>>>>     val result = ...
>>>>
>>>> 2. Write utf function for each column, and then use in together:
>>>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>>
>>>> However, both are awkward,
>>>>
>>>> Does anyone have a better work flow?
>>>> Write some custom Transforms and use pipeline?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Best practice for preprocessing feature with DataFrame

Posted by Stuart White <st...@gmail.com>.

import org.apache.spark.sql.functions._

val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
rows.show

+---+------+
|age|gender|
+---+------+
|90s|     1|
|80s|     2|
|80s|     3|
+---+------+

val modifiedRows
  .select(
    substring('age, 0, 2) as "age",
    when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
  )
modifiedRows.show

+---+-------+
|age| gender|
+---+-------+
| 90|   male|
| 80| female|
| 80|unknown|
+---+-------+

On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
> Could you give me an example, how to use Column function?
> Thanks very much.
>
> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <di...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can use the Column functions provided by Spark API
>>
>>
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>
>> Hope this helps .
>>
>> Thanks,
>> Divya
>>
>>
>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>>>
>>> Hi,
>>> I have a sample, like:
>>> +---+------+--------------------+
>>> |age|gender|             city_id|
>>> +---+------+--------------------+
>>> |   |     1|1042015:city_2044...|
>>> |90s|     2|1042015:city_2035...|
>>> |80s|     2|1042015:city_2061...|
>>> +---+------+--------------------+
>>>
>>> and expectation is:
>>> "age":  90s -> 90, 80s -> 80
>>> "gender": 1 -> "male", 2 -> "female"
>>>
>>> I have two solutions:
>>> 1. Handle each column separately,  and then join all by index.
>>>     val age = input.select("age").map(...)
>>>     val gender = input.select("gender").map(...)
>>>     val result = ...
>>>
>>> 2. Write utf function for each column, and then use in together:
>>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>
>>> However, both are awkward,
>>>
>>> Does anyone have a better work flow?
>>> Write some custom Transforms and use pipeline?
>>>
>>> Thanks.
>>>
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Best practice for preprocessing feature with DataFrame

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.

Could you give me an example, how to use Column function?
Thanks very much.

On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <di...@gmail.com>
wrote:

> Hi,
>
> You can use the Column functions provided by Spark API
>
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
>
> Hope this helps .
>
> Thanks,
> Divya
>
>
> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>
>> Hi,
>> I have a sample, like:
>> +---+------+--------------------+
>> |age|gender|             city_id|
>> +---+------+--------------------+
>> |   |     1|1042015:city_2044...|
>> |90s|     2|1042015:city_2035...|
>> |80s|     2|1042015:city_2061...|
>> +---+------+--------------------+
>>
>> and expectation is:
>> "age":  90s -> 90, 80s -> 80
>> "gender": 1 -> "male", 2 -> "female"
>>
>> I have two solutions:
>> 1. Handle each column separately,  and then join all by index.
>>     val age = input.select("age").map(...)
>>     val gender = input.select("gender").map(...)
>>     val result = ...
>>
>> 2. Write utf function for each column, and then use in together:
>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>
>> However, both are awkward,
>>
>> Does anyone have a better work flow?
>> Write some custom Transforms and use pipeline?
>>
>> Thanks.
>>
>>
>>
>>
>

Re: Best practice for preprocessing feature with DataFrame

Posted by Divya Gehlot <di...@gmail.com>.

Hi,

You can use the Column functions provided by Spark API

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html

Hope this helps .

Thanks,
Divya


On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:

> Hi,
> I have a sample, like:
> +---+------+--------------------+
> |age|gender|             city_id|
> +---+------+--------------------+
> |   |     1|1042015:city_2044...|
> |90s|     2|1042015:city_2035...|
> |80s|     2|1042015:city_2061...|
> +---+------+--------------------+
>
> and expectation is:
> "age":  90s -> 90, 80s -> 80
> "gender": 1 -> "male", 2 -> "female"
>
> I have two solutions:
> 1. Handle each column separately,  and then join all by index.
>     val age = input.select("age").map(...)
>     val gender = input.select("gender").map(...)
>     val result = ...
>
> 2. Write utf function for each column, and then use in together:
>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>
> However, both are awkward,
>
> Does anyone have a better work flow?
> Write some custom Transforms and use pipeline?
>
> Thanks.
>
>
>
>