You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ashu <as...@iiitb.org> on 2014/11/02 18:20:55 UTC

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Hi, 
Sorry to bounce back the old thread. 
What is the state now? Is this problem solved. How spark handle categorical
data now? 

Regards, 
Ashutosh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by lmk <la...@gmail.com>.

Trying to improve the old solution. 
Do we have a better text classifier now in Spark Mllib?

Regards,
lmk



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by Xiangrui Meng <me...@gmail.com>.

This operation requires two transformers:

1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features

We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late! If you want a quick and
dirty solution, you can try hashing:

val rdd: RDD[(Double, Array[String])] = ...
val training = rdd.mapValues { factors =>
    val indices = mutable.Set.empty[Int]
    factors.view.zipWithIndex.foreach { (f, idx) =>
      indices += math.abs(f.## ^ idx) % 100000
    }
    Vectors.sparse(100000, indices.toSeq.map(x => (x, 1.0)))
}

It creates a training dataset with all binary features, with a chance
of collision. You can use it in SVM, LR, or DecisionTree.

Best,
Xiangrui

On Sun, Nov 2, 2014 at 9:20 AM, ashu <as...@iiitb.org> wrote:
> Hi,
> Sorry to bounce back the old thread.
> What is the state now? Is this problem solved. How spark handle categorical
> data now?
>
> Regards,
> Ashutosh
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org