You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2016/02/01 18:21:38 UTC

Spark MLLlib Ideal way to convert categorical features into LabeledPoint RDD?

Hi I have dataset which is completely categorical and it does not contain
even one column as numerical. Now I want to apply classification using Naive
Bayes I have to predict whether given alert is actionable or not using
YES/NO I have the following example of my dataset

DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO

I am using Spark 1.6 and I see there is StringIndexer class which is used
OneHotEncoding example given here
https://spark.apache.org/docs/latest/ml-features.html#onehotencoder but I
have almost 10000 unique words/features to map into continuous how do I
create such a huge map. I have my dataset in csv file please guide me how do
I convert my all the categorical features in csv file and use it in naive
bayes model.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLlib-Ideal-way-to-convert-categorical-features-into-LabeledPoint-RDD-tp26125.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org