You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by saatvikshah1994 <sa...@gmail.com> on 2017/06/15 14:19:37 UTC

Best alternative for Category Type in Spark Dataframe

Hi, 
I'm trying to convert a Pandas -> Spark dataframe. One of the columns I have
is of the Category type in Pandas. But there does not seem to be support for
this same type in Spark. What is the best alternative?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-Spark-Dataframe-tp28764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Best alternative for Category Type in Spark Dataframe

Posted by Saatvik Shah <sa...@gmail.com>.

Thanks guys,

You'll have given a number of options to work with.

The thing is that Im working in a production environment where it might be
necessary to ensure that no one erroneously inserts new records in those
specific columns which should be the Category data type. The best
alternative there would be to have a Category-like dataframe column
datatype, without the additional overhead of running a transformer. Is that
possible?

Thanks and Regards,
Saatvik

On Sat, Jun 17, 2017 at 11:15 PM, Pralabh Kumar <pr...@gmail.com>
wrote:

> make sense :)
>
> On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <fa...@gmail.com>
> wrote:
>
>> Yes, perhaps we could use SQLTransformer as well.
>>
>> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>>
>> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pr...@gmail.com>
>> wrote:
>>
>>> Hi Yan
>>>
>>> Yes sql is good option , but if we have to create ML Pipeline , then
>>> having transformers and set it into pipeline stages ,would be better option
>>> .
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>>> wrote:
>>>
>>>> To filter data, how about using sql?
>>>>
>>>> df.createOrReplaceTempView("df")
>>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>>
>>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>>
>>>>
>>>>
>>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhkumar@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Saatvik
>>>>>
>>>>> You can write your own transformer to make sure that column contains
>>>>> ,value which u provided , and filter out rows which doesn't follow the
>>>>> same.
>>>>>
>>>>> Something like this
>>>>>
>>>>>
>>>>> case class CategoryTransformer(override val uid : String) extends
>>>>> Transformer{
>>>>>   override def transform(inputData: DataFrame): DataFrame = {
>>>>>     inputData.select("col1").filter("col1 in ('happy')")
>>>>>   }
>>>>>   override def copy(extra: ParamMap): Transformer = ???
>>>>>   @DeveloperApi
>>>>>   override def transformSchema(schema: StructType): StructType ={
>>>>>    schema
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>> Usage
>>>>>
>>>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>>>> val trans = new CategoryTransformer("1")
>>>>> data.show()
>>>>> trans.transform(data).show()
>>>>>
>>>>>
>>>>> This transformer will make sure , you always have values in col1 as
>>>>> provided by you.
>>>>>
>>>>>
>>>>> Regards
>>>>> Pralabh Kumar
>>>>>
>>>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
>>>>> saatvikshah1994@gmail.com> wrote:
>>>>>
>>>>>> Hi Pralabh,
>>>>>>
>>>>>> I want the ability to create a column such that its values be
>>>>>> restricted to a specific set of predefined values.
>>>>>> For example, suppose I have a column called EMOTION: I want to ensure
>>>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Saatvik Shah
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>>>>>> pralabhkumar@gmail.com> wrote:
>>>>>>
>>>>>>> Hi satvik
>>>>>>>
>>>>>>> Can u please provide an example of what exactly you want.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Yan,
>>>>>>>>
>>>>>>>> Basically the reason I was looking for the categorical datatype is
>>>>>>>> as given here
>>>>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>>>>> ability to fix column values to specific categories. Is it possible to
>>>>>>>> create a user defined data type which could do so?
>>>>>>>>
>>>>>>>> Thanks and Regards,
>>>>>>>> Saatvik Shah
>>>>>>>>
>>>>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <
>>>>>>>> facai.yan@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> You can use some Transformers to handle categorical data,
>>>>>>>>> For example,
>>>>>>>>> StringIndexer encodes a string column of labels to a column of
>>>>>>>>> label indices:
>>>>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>>>>> saatvikshah1994@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>>>>> columns I have
>>>>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>>>>> support for
>>>>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com.
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>> ---------
>>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Saatvik Shah,*
>>>>>>>> *1st  Year,*
>>>>>>>> *Masters in the School of Computer Science,*
>>>>>>>> *Carnegie Mellon University*
>>>>>>>>
>>>>>>>> *https://saatvikshah1994.github.io/
>>>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Saatvik Shah,*
>>>>>> *1st  Year,*
>>>>>> *Masters in the School of Computer Science,*
>>>>>> *Carnegie Mellon University*
>>>>>>
>>>>>> *https://saatvikshah1994.github.io/
>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*

Re: Best alternative for Category Type in Spark Dataframe

Posted by Pralabh Kumar <pr...@gmail.com>.

make sense :)

On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <fa...@gmail.com> wrote:

> Yes, perhaps we could use SQLTransformer as well.
>
> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>
> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pr...@gmail.com>
> wrote:
>
>> Hi Yan
>>
>> Yes sql is good option , but if we have to create ML Pipeline , then
>> having transformers and set it into pipeline stages ,would be better option
>> .
>>
>> Regards
>> Pralabh Kumar
>>
>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>> wrote:
>>
>>> To filter data, how about using sql?
>>>
>>> df.createOrReplaceTempView("df")
>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pr...@gmail.com>
>>> wrote:
>>>
>>>> Hi Saatvik
>>>>
>>>> You can write your own transformer to make sure that column contains
>>>> ,value which u provided , and filter out rows which doesn't follow the
>>>> same.
>>>>
>>>> Something like this
>>>>
>>>>
>>>> case class CategoryTransformer(override val uid : String) extends
>>>> Transformer{
>>>>   override def transform(inputData: DataFrame): DataFrame = {
>>>>     inputData.select("col1").filter("col1 in ('happy')")
>>>>   }
>>>>   override def copy(extra: ParamMap): Transformer = ???
>>>>   @DeveloperApi
>>>>   override def transformSchema(schema: StructType): StructType ={
>>>>    schema
>>>>   }
>>>> }
>>>>
>>>>
>>>> Usage
>>>>
>>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>>> val trans = new CategoryTransformer("1")
>>>> data.show()
>>>> trans.transform(data).show()
>>>>
>>>>
>>>> This transformer will make sure , you always have values in col1 as
>>>> provided by you.
>>>>
>>>>
>>>> Regards
>>>> Pralabh Kumar
>>>>
>>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
>>>> saatvikshah1994@gmail.com> wrote:
>>>>
>>>>> Hi Pralabh,
>>>>>
>>>>> I want the ability to create a column such that its values be
>>>>> restricted to a specific set of predefined values.
>>>>> For example, suppose I have a column called EMOTION: I want to ensure
>>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>>>
>>>>> Thanks and Regards,
>>>>> Saatvik Shah
>>>>>
>>>>>
>>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>>>>> pralabhkumar@gmail.com> wrote:
>>>>>
>>>>>> Hi satvik
>>>>>>
>>>>>> Can u please provide an example of what exactly you want.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Yan,
>>>>>>>
>>>>>>> Basically the reason I was looking for the categorical datatype is
>>>>>>> as given here
>>>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>>>> ability to fix column values to specific categories. Is it possible to
>>>>>>> create a user defined data type which could do so?
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Saatvik Shah
>>>>>>>
>>>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai.yan@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You can use some Transformers to handle categorical data,
>>>>>>>> For example,
>>>>>>>> StringIndexer encodes a string column of labels to a column of
>>>>>>>> label indices:
>>>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>>>> saatvikshah1994@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>>>> columns I have
>>>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>>>> support for
>>>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Saatvik Shah,*
>>>>>>> *1st  Year,*
>>>>>>> *Masters in the School of Computer Science,*
>>>>>>> *Carnegie Mellon University*
>>>>>>>
>>>>>>> *https://saatvikshah1994.github.io/
>>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Saatvik Shah,*
>>>>> *1st  Year,*
>>>>> *Masters in the School of Computer Science,*
>>>>> *Carnegie Mellon University*
>>>>>
>>>>> *https://saatvikshah1994.github.io/
>>>>> <https://saatvikshah1994.github.io/>*
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Best alternative for Category Type in Spark Dataframe

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

Yes, perhaps we could use SQLTransformer as well.

http://spark.apache.org/docs/latest/ml-features.html#sqltransformer

On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pr...@gmail.com>
wrote:

> Hi Yan
>
> Yes sql is good option , but if we have to create ML Pipeline , then
> having transformers and set it into pipeline stages ,would be better option
> .
>
> Regards
> Pralabh Kumar
>
> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <fa...@gmail.com>
> wrote:
>
>> To filter data, how about using sql?
>>
>> df.createOrReplaceTempView("df")
>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>
>>
>>
>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pr...@gmail.com>
>> wrote:
>>
>>> Hi Saatvik
>>>
>>> You can write your own transformer to make sure that column contains
>>> ,value which u provided , and filter out rows which doesn't follow the
>>> same.
>>>
>>> Something like this
>>>
>>>
>>> case class CategoryTransformer(override val uid : String) extends
>>> Transformer{
>>>   override def transform(inputData: DataFrame): DataFrame = {
>>>     inputData.select("col1").filter("col1 in ('happy')")
>>>   }
>>>   override def copy(extra: ParamMap): Transformer = ???
>>>   @DeveloperApi
>>>   override def transformSchema(schema: StructType): StructType ={
>>>    schema
>>>   }
>>> }
>>>
>>>
>>> Usage
>>>
>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>> val trans = new CategoryTransformer("1")
>>> data.show()
>>> trans.transform(data).show()
>>>
>>>
>>> This transformer will make sure , you always have values in col1 as
>>> provided by you.
>>>
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <saatvikshah1994@gmail.com
>>> > wrote:
>>>
>>>> Hi Pralabh,
>>>>
>>>> I want the ability to create a column such that its values be
>>>> restricted to a specific set of predefined values.
>>>> For example, suppose I have a column called EMOTION: I want to ensure
>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>>
>>>> Thanks and Regards,
>>>> Saatvik Shah
>>>>
>>>>
>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhkumar@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi satvik
>>>>>
>>>>> Can u please provide an example of what exactly you want.
>>>>>
>>>>>
>>>>>
>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Yan,
>>>>>>
>>>>>> Basically the reason I was looking for the categorical datatype is as
>>>>>> given here
>>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>>> ability to fix column values to specific categories. Is it possible to
>>>>>> create a user defined data type which could do so?
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Saatvik Shah
>>>>>>
>>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You can use some Transformers to handle categorical data,
>>>>>>> For example,
>>>>>>> StringIndexer encodes a string column of labels to a column of
>>>>>>> label indices:
>>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>>> saatvikshah1994@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>>> columns I have
>>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>>> support for
>>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Saatvik Shah,*
>>>>>> *1st  Year,*
>>>>>> *Masters in the School of Computer Science,*
>>>>>> *Carnegie Mellon University*
>>>>>>
>>>>>> *https://saatvikshah1994.github.io/
>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Saatvik Shah,*
>>>> *1st  Year,*
>>>> *Masters in the School of Computer Science,*
>>>> *Carnegie Mellon University*
>>>>
>>>> *https://saatvikshah1994.github.io/
>>>> <https://saatvikshah1994.github.io/>*
>>>>
>>>
>>>
>>
>

Re: Best alternative for Category Type in Spark Dataframe

Posted by Pralabh Kumar <pr...@gmail.com>.

Hi Yan

Yes sql is good option , but if we have to create ML Pipeline , then having
transformers and set it into pipeline stages ,would be better option .

Regards
Pralabh Kumar

On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <fa...@gmail.com> wrote:

> To filter data, how about using sql?
>
> df.createOrReplaceTempView("df")
> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>
>
>
> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pr...@gmail.com>
> wrote:
>
>> Hi Saatvik
>>
>> You can write your own transformer to make sure that column contains
>> ,value which u provided , and filter out rows which doesn't follow the
>> same.
>>
>> Something like this
>>
>>
>> case class CategoryTransformer(override val uid : String) extends
>> Transformer{
>>   override def transform(inputData: DataFrame): DataFrame = {
>>     inputData.select("col1").filter("col1 in ('happy')")
>>   }
>>   override def copy(extra: ParamMap): Transformer = ???
>>   @DeveloperApi
>>   override def transformSchema(schema: StructType): StructType ={
>>    schema
>>   }
>> }
>>
>>
>> Usage
>>
>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>> val trans = new CategoryTransformer("1")
>> data.show()
>> trans.transform(data).show()
>>
>>
>> This transformer will make sure , you always have values in col1 as
>> provided by you.
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <sa...@gmail.com>
>> wrote:
>>
>>> Hi Pralabh,
>>>
>>> I want the ability to create a column such that its values be restricted
>>> to a specific set of predefined values.
>>> For example, suppose I have a column called EMOTION: I want to ensure
>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>>
>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pr...@gmail.com>
>>> wrote:
>>>
>>>> Hi satvik
>>>>
>>>> Can u please provide an example of what exactly you want.
>>>>
>>>>
>>>>
>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Yan,
>>>>>
>>>>> Basically the reason I was looking for the categorical datatype is as
>>>>> given here
>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>> ability to fix column values to specific categories. Is it possible to
>>>>> create a user defined data type which could do so?
>>>>>
>>>>> Thanks and Regards,
>>>>> Saatvik Shah
>>>>>
>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can use some Transformers to handle categorical data,
>>>>>> For example,
>>>>>> StringIndexer encodes a string column of labels to a column of label
>>>>>> indices:
>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>> saatvikshah1994@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>> columns I have
>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>> support for
>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Saatvik Shah,*
>>>>> *1st  Year,*
>>>>> *Masters in the School of Computer Science,*
>>>>> *Carnegie Mellon University*
>>>>>
>>>>> *https://saatvikshah1994.github.io/
>>>>> <https://saatvikshah1994.github.io/>*
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>>
>>
>>
>

Re: Best alternative for Category Type in Spark Dataframe

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

To filter data, how about using sql?

df.createOrReplaceTempView("df")
val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN
(HAPPY,SAD,ANGRY,NEUTRAL,NA)")

https://spark.apache.org/docs/latest/sql-programming-guide.html#sql



On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pr...@gmail.com>
wrote:

> Hi Saatvik
>
> You can write your own transformer to make sure that column contains
> ,value which u provided , and filter out rows which doesn't follow the
> same.
>
> Something like this
>
>
> case class CategoryTransformer(override val uid : String) extends
> Transformer{
>   override def transform(inputData: DataFrame): DataFrame = {
>     inputData.select("col1").filter("col1 in ('happy')")
>   }
>   override def copy(extra: ParamMap): Transformer = ???
>   @DeveloperApi
>   override def transformSchema(schema: StructType): StructType ={
>    schema
>   }
> }
>
>
> Usage
>
> val data = sc.parallelize(List("abce","happy")).toDF("col1")
> val trans = new CategoryTransformer("1")
> data.show()
> trans.transform(data).show()
>
>
> This transformer will make sure , you always have values in col1 as
> provided by you.
>
>
> Regards
> Pralabh Kumar
>
> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <sa...@gmail.com>
> wrote:
>
>> Hi Pralabh,
>>
>> I want the ability to create a column such that its values be restricted
>> to a specific set of predefined values.
>> For example, suppose I have a column called EMOTION: I want to ensure
>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>>
>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pr...@gmail.com>
>> wrote:
>>
>>> Hi satvik
>>>
>>> Can u please provide an example of what exactly you want.
>>>
>>>
>>>
>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Yan,
>>>>
>>>> Basically the reason I was looking for the categorical datatype is as
>>>> given here
>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>> ability to fix column values to specific categories. Is it possible to
>>>> create a user defined data type which could do so?
>>>>
>>>> Thanks and Regards,
>>>> Saatvik Shah
>>>>
>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can use some Transformers to handle categorical data,
>>>>> For example,
>>>>> StringIndexer encodes a string column of labels to a column of label
>>>>> indices:
>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>
>>>>>
>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>> saatvikshah1994@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
>>>>>> I have
>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>> support for
>>>>>> this same type in Spark. What is the best alternative?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>> Spark-Dataframe-tp28764.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Saatvik Shah,*
>>>> *1st  Year,*
>>>> *Masters in the School of Computer Science,*
>>>> *Carnegie Mellon University*
>>>>
>>>> *https://saatvikshah1994.github.io/
>>>> <https://saatvikshah1994.github.io/>*
>>>>
>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>
>
>

Re: Best alternative for Category Type in Spark Dataframe

Posted by Pralabh Kumar <pr...@gmail.com>.

Hi Saatvik

You can write your own transformer to make sure that column contains ,value
which u provided , and filter out rows which doesn't follow the same.

Something like this


case class CategoryTransformer(override val uid : String) extends
Transformer{
  override def transform(inputData: DataFrame): DataFrame = {
    inputData.select("col1").filter("col1 in ('happy')")
  }
  override def copy(extra: ParamMap): Transformer = ???
  @DeveloperApi
  override def transformSchema(schema: StructType): StructType ={
   schema
  }
}


Usage

val data = sc.parallelize(List("abce","happy")).toDF("col1")
val trans = new CategoryTransformer("1")
data.show()
trans.transform(data).show()


This transformer will make sure , you always have values in col1 as
provided by you.


Regards
Pralabh Kumar

On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <sa...@gmail.com>
wrote:

> Hi Pralabh,
>
> I want the ability to create a column such that its values be restricted
> to a specific set of predefined values.
> For example, suppose I have a column called EMOTION: I want to ensure each
> row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>
> Thanks and Regards,
> Saatvik Shah
>
>
> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pr...@gmail.com>
> wrote:
>
>> Hi satvik
>>
>> Can u please provide an example of what exactly you want.
>>
>>
>>
>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com> wrote:
>>
>>> Hi Yan,
>>>
>>> Basically the reason I was looking for the categorical datatype is as
>>> given here
>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>> ability to fix column values to specific categories. Is it possible to
>>> create a user defined data type which could do so?
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>>> wrote:
>>>
>>>> You can use some Transformers to handle categorical data,
>>>> For example,
>>>> StringIndexer encodes a string column of labels to a column of label
>>>> indices:
>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>
>>>>
>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>> saatvikshah1994@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
>>>>> I have
>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>> support for
>>>>> this same type in Spark. What is the best alternative?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://apache-spark-user-list.
>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>> Spark-Dataframe-tp28764.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>

Re: Best alternative for Category Type in Spark Dataframe

Posted by Saatvik Shah <sa...@gmail.com>.

Hi Pralabh,

I want the ability to create a column such that its values be restricted to
a specific set of predefined values.
For example, suppose I have a column called EMOTION: I want to ensure each
row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pr...@gmail.com>
wrote:

> Hi satvik
>
> Can u please provide an example of what exactly you want.
>
>
>
> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com> wrote:
>
>> Hi Yan,
>>
>> Basically the reason I was looking for the categorical datatype is as
>> given here
>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>: ability
>> to fix column values to specific categories. Is it possible to create a
>> user defined data type which could do so?
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com>
>> wrote:
>>
>>> You can use some Transformers to handle categorical data,
>>> For example,
>>> StringIndexer encodes a string column of labels to a column of label
>>> indices:
>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>
>>>
>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>> saatvikshah1994@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>>>> have
>>>> is of the Category type in Pandas. But there does not seem to be
>>>> support for
>>>> this same type in Spark. What is the best alternative?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>> Spark-Dataframe-tp28764.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*

Re: Best alternative for Category Type in Spark Dataframe

Posted by Pralabh Kumar <pr...@gmail.com>.

Hi satvik

Can u please provide an example of what exactly you want.



On 16-Jun-2017 7:40 PM, "Saatvik Shah" <sa...@gmail.com> wrote:

> Hi Yan,
>
> Basically the reason I was looking for the categorical datatype is as
> given here <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
> ability to fix column values to specific categories. Is it possible to
> create a user defined data type which could do so?
>
> Thanks and Regards,
> Saatvik Shah
>
> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com>
> wrote:
>
>> You can use some Transformers to handle categorical data,
>> For example,
>> StringIndexer encodes a string column of labels to a column of label
>> indices:
>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>
>>
>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>> saatvikshah1994@gmail.com> wrote:
>>
>>> Hi,
>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>>> have
>>> is of the Category type in Pandas. But there does not seem to be support
>>> for
>>> this same type in Spark. What is the best alternative?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>> Spark-Dataframe-tp28764.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>

Re: Best alternative for Category Type in Spark Dataframe

Posted by Saatvik Shah <sa...@gmail.com>.

Hi Yan,

Basically the reason I was looking for the categorical datatype is as given
here <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
ability to fix column values to specific categories. Is it possible to
create a user defined data type which could do so?

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <fa...@gmail.com> wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of label
> indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1994@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>> have
>> is of the Category type in Pandas. But there does not seem to be support
>> for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*

Re: Best alternative for Category Type in Spark Dataframe

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.

You can use some Transformers to handle categorical data,
For example,
StringIndexer encodes a string column of labels to a column of label
indices:
http://spark.apache.org/docs/latest/ml-features.html#stringindexer


On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <saatvikshah1994@gmail.com
> wrote:

> Hi,
> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
> have
> is of the Category type in Pandas. But there does not seem to be support
> for
> this same type in Spark. What is the best alternative?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-
> in-Spark-Dataframe-tp28764.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>