You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "mahendra singh (JIRA)" <ji...@apache.org> on 2016/06/30 05:36:10 UTC
[jira] [Commented] (SPARK-16290) text type features column for
classification
[ https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356548#comment-15356548 ]
mahendra singh commented on SPARK-16290:
----------------------------------------
[~srowen] Hi srowen ,
have one issue with spark regarding with text type features for naive bayes .
I have following data
Male , Suspicion of Alcohol , Weekday , 12 ,75 , 30-39
Male , Moving Traffic Violation , Weekday , 12 , 20 ,20-24
Male , Suspicion of Alcohol , Weekend , 4 , 1 2, 40-49
Male , Suspicion of Alcohol , Weekday , 12 , 0 , 50-59
Female , Road Traffic Collision , Weekend , 12 , 0 , 20-24
Male , Road Traffic Collision , Weekday , 12 , 0 , 25-29
Male , Road Traffic Collision , Weekday , 8 , 0 , Other
Male , Road Traffic Collision , Weekday , 8 , 23 , 60-69
Male , Moving Traffic Violation , Weekend , 4, 26, 30-39
Female , Road Traffic Collision , Weekend, 8 , 61, 16-19
Male , Moving Traffic Violation , Weekend , 4 , 74 , 25-29
Male , Road Traffic Collision , Weekday , 12, 0 , Other
Male , Moving Traffic Violation , Weekday , 8 , 0 , 16-19
Male , Road Traffic Collision , Weekday , 8 , 0 , Other
Male , Moving Traffic Violation , Weekend , 4 , 0 ,30-39
In this data you can see some column (comma separated ) are numeric and some are text data . Now spark naive bayes only support numeric type data . So how can transform text type to numeric type . Every time ( training and testing ) numeric value for text type should be same other wise it will create problem .
Is it possible through spark now , i am asking because i did not find solution for this . If it is possible then how and if not then can solve this issue ?
> text type features column for classification
> --------------------------------------------
>
> Key: SPARK-16290
> URL: https://issues.apache.org/jira/browse/SPARK-16290
> Project: Spark
> Issue Type: New Feature
> Components: ML, MLilb
> Affects Versions: 1.6.2
> Reporter: mahendra singh
> Labels: features
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> we have to improve spark ml and mllib in case of features columns . Mean we can give text type of value also in features .
> Suppose we have 4 features value
> id. dept_name. score. result.
> We can see dept_name will be text type so we have to handle it internally in spark mean we have to change text to numerical column .
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org