You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "mahendra singh (JIRA)" <ji...@apache.org> on 2016/06/30 05:36:10 UTC

[jira] [Commented] (SPARK-16290) text type features column for classification

    [ https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356548#comment-15356548 ] 

mahendra singh commented on SPARK-16290:
----------------------------------------

[~srowen] Hi srowen , 
 have one issue with spark regarding with text type features for naive bayes . 
I have following data 

Male , Suspicion of Alcohol , Weekday , 12 ,75 , 30-39 
Male , Moving Traffic Violation , Weekday , 12 , 20 ,20-24 
Male , Suspicion of Alcohol , Weekend , 4 , 1 2, 40-49 
Male , Suspicion of Alcohol , Weekday , 12 , 0 , 50-59 
Female , Road Traffic Collision , Weekend , 12 , 0 , 20-24 
Male , Road Traffic Collision  , Weekday , 12 , 0 , 25-29 
Male , Road Traffic Collision , Weekday , 8 , 0 , Other 
Male , Road Traffic Collision , Weekday , 8 , 23 , 60-69
Male , Moving Traffic Violation  , Weekend , 4, 26, 30-39
Female , Road Traffic Collision , Weekend, 8 , 61, 16-19  
Male , Moving Traffic Violation , Weekend , 4 , 74 , 25-29 
Male , Road Traffic Collision , Weekday , 12, 0 , Other 
Male  , Moving Traffic Violation , Weekday , 8 , 0 , 16-19 
Male , Road Traffic Collision , Weekday , 8 , 0 , Other
Male , Moving Traffic Violation , Weekend , 4 , 0 ,30-39

In this data you can see some column (comma separated ) are numeric and some are text data . Now spark naive bayes only support numeric type data . So how can transform text type to numeric  type . Every time ( training and testing ) numeric value for text type should be same other wise it will create problem . 
Is it possible through spark now , i am asking because i did not find solution for this . If it is possible then how and if not then can solve this issue ?

> text type features column for classification
> --------------------------------------------
>
>                 Key: SPARK-16290
>                 URL: https://issues.apache.org/jira/browse/SPARK-16290
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLilb
>    Affects Versions: 1.6.2
>            Reporter: mahendra singh
>              Labels: features
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> we have to improve spark ml and mllib in case of features columns . Mean we can give text type of value also in features . 
> Suppose we have 4 features value 
> id. dept_name. score. result. 
> We can see dept_name will be text type so we have to handle it internally in spark mean we have to change text to numerical column . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org