You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "颜发才 (Yan Facai)" <ya...@gmail.com> on 2016/11/17 04:08:26 UTC
Best practice for preprocessing feature with DataFrame
Hi,
I have a sample, like:
+---+------+--------------------+
|age|gender| city_id|
+---+------+--------------------+
| | 1|1042015:city_2044...|
|90s| 2|1042015:city_2035...|
|80s| 2|1042015:city_2061...|
+---+------+--------------------+
and expectation is:
"age": 90s -> 90, 80s -> 80
"gender": 1 -> "male", 2 -> "female"
I have two solutions:
1. Handle each column separately, and then join all by index.
val age = input.select("age").map(...)
val gender = input.select("gender").map(...)
val result = ...
2. Write utf function for each column, and then use in together:
val result = input.select(ageUDF($"age"), genderUDF($"gender"))
However, both are awkward,
Does anyone have a better work flow?
Write some custom Transforms and use pipeline?
Thanks.
Re: Best practice for preprocessing feature with DataFrame
Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.
Thanks, White.
On Thu, Nov 17, 2016 at 11:15 PM, Stuart White <st...@gmail.com>
wrote:
> Sorry. Small typo. That last part should be:
>
> val modifiedRows = rows
> .select(
> substring('age, 0, 2) as "age",
> when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
> )
> modifiedRows.show
>
> +---+-------+
> |age| gender|
> +---+-------+
> | 90| male|
> | 80| female|
> | 80|unknown|
> +---+-------+
>
> On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <st...@gmail.com>
> wrote:
> > import org.apache.spark.sql.functions._
> >
> > val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> > rows.show
> >
> > +---+------+
> > |age|gender|
> > +---+------+
> > |90s| 1|
> > |80s| 2|
> > |80s| 3|
> > +---+------+
> >
> > val modifiedRows
> > .select(
> > substring('age, 0, 2) as "age",
> > when('gender === 1, "male").otherwise(when('gender === 2,
> > "female").otherwise("unknown")) as "gender"
> > )
> > modifiedRows.show
> >
> > +---+-------+
> > |age| gender|
> > +---+-------+
> > | 90| male|
> > | 80| female|
> > | 80|unknown|
> > +---+-------+
> >
> > On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <ya...@gmail.com>
> wrote:
> >> Could you give me an example, how to use Column function?
> >> Thanks very much.
> >>
> >> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <divya.htconex@gmail.com
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> You can use the Column functions provided by Spark API
> >>>
> >>>
> >>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
> >>>
> >>> Hope this helps .
> >>>
> >>> Thanks,
> >>> Divya
> >>>
> >>>
> >>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>> I have a sample, like:
> >>>> +---+------+--------------------+
> >>>> |age|gender| city_id|
> >>>> +---+------+--------------------+
> >>>> | | 1|1042015:city_2044...|
> >>>> |90s| 2|1042015:city_2035...|
> >>>> |80s| 2|1042015:city_2061...|
> >>>> +---+------+--------------------+
> >>>>
> >>>> and expectation is:
> >>>> "age": 90s -> 90, 80s -> 80
> >>>> "gender": 1 -> "male", 2 -> "female"
> >>>>
> >>>> I have two solutions:
> >>>> 1. Handle each column separately, and then join all by index.
> >>>> val age = input.select("age").map(...)
> >>>> val gender = input.select("gender").map(...)
> >>>> val result = ...
> >>>>
> >>>> 2. Write utf function for each column, and then use in together:
> >>>> val result = input.select(ageUDF($"age"), genderUDF($"gender"))
> >>>>
> >>>> However, both are awkward,
> >>>>
> >>>> Does anyone have a better work flow?
> >>>> Write some custom Transforms and use pipeline?
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
Re: Best practice for preprocessing feature with DataFrame
Posted by Stuart White <st...@gmail.com>.
Sorry. Small typo. That last part should be:
val modifiedRows = rows
.select(
substring('age, 0, 2) as "age",
when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
)
modifiedRows.show
+---+-------+
|age| gender|
+---+-------+
| 90| male|
| 80| female|
| 80|unknown|
+---+-------+
On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <st...@gmail.com> wrote:
> import org.apache.spark.sql.functions._
>
> val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> rows.show
>
> +---+------+
> |age|gender|
> +---+------+
> |90s| 1|
> |80s| 2|
> |80s| 3|
> +---+------+
>
> val modifiedRows
> .select(
> substring('age, 0, 2) as "age",
> when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
> )
> modifiedRows.show
>
> +---+-------+
> |age| gender|
> +---+-------+
> | 90| male|
> | 80| female|
> | 80|unknown|
> +---+-------+
>
> On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>> Could you give me an example, how to use Column function?
>> Thanks very much.
>>
>> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <di...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> You can use the Column functions provided by Spark API
>>>
>>>
>>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>>
>>> Hope this helps .
>>>
>>> Thanks,
>>> Divya
>>>
>>>
>>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> I have a sample, like:
>>>> +---+------+--------------------+
>>>> |age|gender| city_id|
>>>> +---+------+--------------------+
>>>> | | 1|1042015:city_2044...|
>>>> |90s| 2|1042015:city_2035...|
>>>> |80s| 2|1042015:city_2061...|
>>>> +---+------+--------------------+
>>>>
>>>> and expectation is:
>>>> "age": 90s -> 90, 80s -> 80
>>>> "gender": 1 -> "male", 2 -> "female"
>>>>
>>>> I have two solutions:
>>>> 1. Handle each column separately, and then join all by index.
>>>> val age = input.select("age").map(...)
>>>> val gender = input.select("gender").map(...)
>>>> val result = ...
>>>>
>>>> 2. Write utf function for each column, and then use in together:
>>>> val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>>
>>>> However, both are awkward,
>>>>
>>>> Does anyone have a better work flow?
>>>> Write some custom Transforms and use pipeline?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>
>>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Best practice for preprocessing feature with DataFrame
Posted by Stuart White <st...@gmail.com>.
import org.apache.spark.sql.functions._
val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
rows.show
+---+------+
|age|gender|
+---+------+
|90s| 1|
|80s| 2|
|80s| 3|
+---+------+
val modifiedRows
.select(
substring('age, 0, 2) as "age",
when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
)
modifiedRows.show
+---+-------+
|age| gender|
+---+-------+
| 90| male|
| 80| female|
| 80|unknown|
+---+-------+
On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
> Could you give me an example, how to use Column function?
> Thanks very much.
>
> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <di...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can use the Column functions provided by Spark API
>>
>>
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>
>> Hope this helps .
>>
>> Thanks,
>> Divya
>>
>>
>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>>>
>>> Hi,
>>> I have a sample, like:
>>> +---+------+--------------------+
>>> |age|gender| city_id|
>>> +---+------+--------------------+
>>> | | 1|1042015:city_2044...|
>>> |90s| 2|1042015:city_2035...|
>>> |80s| 2|1042015:city_2061...|
>>> +---+------+--------------------+
>>>
>>> and expectation is:
>>> "age": 90s -> 90, 80s -> 80
>>> "gender": 1 -> "male", 2 -> "female"
>>>
>>> I have two solutions:
>>> 1. Handle each column separately, and then join all by index.
>>> val age = input.select("age").map(...)
>>> val gender = input.select("gender").map(...)
>>> val result = ...
>>>
>>> 2. Write utf function for each column, and then use in together:
>>> val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>
>>> However, both are awkward,
>>>
>>> Does anyone have a better work flow?
>>> Write some custom Transforms and use pipeline?
>>>
>>> Thanks.
>>>
>>>
>>>
>>
>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Best practice for preprocessing feature with DataFrame
Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.
Could you give me an example, how to use Column function?
Thanks very much.
On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <di...@gmail.com>
wrote:
> Hi,
>
> You can use the Column functions provided by Spark API
>
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
>
> Hope this helps .
>
> Thanks,
> Divya
>
>
> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
>
>> Hi,
>> I have a sample, like:
>> +---+------+--------------------+
>> |age|gender| city_id|
>> +---+------+--------------------+
>> | | 1|1042015:city_2044...|
>> |90s| 2|1042015:city_2035...|
>> |80s| 2|1042015:city_2061...|
>> +---+------+--------------------+
>>
>> and expectation is:
>> "age": 90s -> 90, 80s -> 80
>> "gender": 1 -> "male", 2 -> "female"
>>
>> I have two solutions:
>> 1. Handle each column separately, and then join all by index.
>> val age = input.select("age").map(...)
>> val gender = input.select("gender").map(...)
>> val result = ...
>>
>> 2. Write utf function for each column, and then use in together:
>> val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>
>> However, both are awkward,
>>
>> Does anyone have a better work flow?
>> Write some custom Transforms and use pipeline?
>>
>> Thanks.
>>
>>
>>
>>
>
Re: Best practice for preprocessing feature with DataFrame
Posted by Divya Gehlot <di...@gmail.com>.
Hi,
You can use the Column functions provided by Spark API
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
Hope this helps .
Thanks,
Divya
On 17 November 2016 at 12:08, 颜发才(Yan Facai) <ya...@gmail.com> wrote:
> Hi,
> I have a sample, like:
> +---+------+--------------------+
> |age|gender| city_id|
> +---+------+--------------------+
> | | 1|1042015:city_2044...|
> |90s| 2|1042015:city_2035...|
> |80s| 2|1042015:city_2061...|
> +---+------+--------------------+
>
> and expectation is:
> "age": 90s -> 90, 80s -> 80
> "gender": 1 -> "male", 2 -> "female"
>
> I have two solutions:
> 1. Handle each column separately, and then join all by index.
> val age = input.select("age").map(...)
> val gender = input.select("gender").map(...)
> val result = ...
>
> 2. Write utf function for each column, and then use in together:
> val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>
> However, both are awkward,
>
> Does anyone have a better work flow?
> Write some custom Transforms and use pipeline?
>
> Thanks.
>
>
>
>