You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lmk <la...@gmail.com> on 2014/06/24 13:17:08 UTC

Prediction using Classification with text attributes in Apache Spark MLLib

Hi,
I am trying to predict an attribute with binary value (Yes/No) using SVM.
All my attributes which belong to the training set are text attributes. 
I understand that I have to convert my outcome as double (0.0/1.0). But I
donot understand how to deal with my explanatory variables which are also
text.
Please let me know how I can do this.

Thanks.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by Sean Owen <so...@cloudera.com>.

On Tue, Jun 24, 2014 at 12:28 PM, Ulanov, Alexander
<al...@hp.com> wrote:
> You need to convert your text to vector space model: http://en.wikipedia.org/wiki/Vector_space_model
> and then pass it to SVM. As far as I know, in previous versions of MLlib there was a special class for doing this: https://github.com/amplab/MLI/blob/master/src/main/scala/feat/NGrams.scala. It is not compatible with Spark 1.0.
> I wonder why MLLib folks didn't include it in newer versions of Spark.

(PS that is a class from MLI, not MLlib)

Re: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by lmk <la...@gmail.com>.

Trying to improve the old solution. 
Do we have a better text classifier now in Spark Mllib?

Regards,
lmk



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by Xiangrui Meng <me...@gmail.com>.

This operation requires two transformers:

1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features

We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late! If you want a quick and
dirty solution, you can try hashing:

val rdd: RDD[(Double, Array[String])] = ...
val training = rdd.mapValues { factors =>
    val indices = mutable.Set.empty[Int]
    factors.view.zipWithIndex.foreach { (f, idx) =>
      indices += math.abs(f.## ^ idx) % 100000
    }
    Vectors.sparse(100000, indices.toSeq.map(x => (x, 1.0)))
}

It creates a training dataset with all binary features, with a chance
of collision. You can use it in SVM, LR, or DecisionTree.

Best,
Xiangrui

On Sun, Nov 2, 2014 at 9:20 AM, ashu <as...@iiitb.org> wrote:
> Hi,
> Sorry to bounce back the old thread.
> What is the state now? Is this problem solved. How spark handle categorical
> data now?
>
> Regards,
> Ashutosh
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by ashu <as...@iiitb.org>.

Hi, 
Sorry to bounce back the old thread. 
What is the state now? Is this problem solved. How spark handle categorical
data now? 

Regards, 
Ashutosh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by lmk <la...@gmail.com>.

Thanks Alexander, That gave me a clear idea of what I can look for in MLLib.

Regards,
lmk



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8395.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by "Ulanov, Alexander" <al...@hp.com>.

Hi,

I cannot argue about other use-cases, however MLLib doesn’t support working with text classification out of the box. There was basic support in MLI (thanks Sean for correcting me that it is MLI not MLLib), but I don’t know why it is not developed anymore.

For text classification in general, there are two major input formats: folders with text files and csv files. I can use SparkContext.textFile to load them into RDD. However in case of csv, I need to parse the loaded data, which is additional overhead. Next, I need to build dictionary of words and convert my documents into vector space using this dictionary. Currently I’m trying to implement these utilities and probably will share the code.

Best regards, Alexander

From: Debasish Das [mailto:debasish.das83@gmail.com]
Sent: Wednesday, June 25, 2014 8:08 PM
To: user@spark.apache.org
Subject: RE: Prediction using Classification with text attributes in Apache Spark MLLib


Libsvm dataset converters are data dependent since your input data can be in any serialization format and not necessarily csv...

We have flows that coverts hdfs data to libsvm/sparse vector rdd which is sent to mllib....

I am not sure if it will be easy to standardize libsvm converter on data that can be on hdfs,hbase, cassandra or solr....but of course libsvm, netflix format, csv are standard for algorithms and mllib supports all 3...
On Jun 25, 2014 6:00 AM, "Ulanov, Alexander" <al...@hp.com>> wrote:
Hi Imk,

I am not aware of any classifier in MLLib that accept nominal type of data. They do accept RDD of LabeledPoints, which are label + vector of Double. So, you'll need to convert nominal to double.

Best regards, Alexander

-----Original Message-----
From: lmk [mailto:lakshmi.muralikrishnan@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 25, 2014 1:27 PM
To: user@spark.incubator.apache.org<ma...@spark.incubator.apache.org>
Subject: RE: Prediction using Classification with text attributes in Apache Spark MLLib

Hi Alexander,
Just one more question on a related note. Should I be following the same procedure even if my data is nominal (categorical), but having a lot of combinations? (In Weka I used to have it as nominal data)

Regards,
-lmk



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by Debasish Das <de...@gmail.com>.

Libsvm dataset converters are data dependent since your input data can be
in any serialization format and not necessarily csv...

We have flows that coverts hdfs data to libsvm/sparse vector rdd which is
sent to mllib....

I am not sure if it will be easy to standardize libsvm converter on data
that can be on hdfs,hbase, cassandra or solr....but of course libsvm,
netflix format, csv are standard for algorithms and mllib supports all 3...
 On Jun 25, 2014 6:00 AM, "Ulanov, Alexander" <al...@hp.com>
wrote:

> Hi Imk,
>
> I am not aware of any classifier in MLLib that accept nominal type of
> data. They do accept RDD of LabeledPoints, which are label + vector of
> Double. So, you'll need to convert nominal to double.
>
> Best regards, Alexander
>
> -----Original Message-----
> From: lmk [mailto:lakshmi.muralikrishnan@gmail.com]
> Sent: Wednesday, June 25, 2014 1:27 PM
> To: user@spark.incubator.apache.org
> Subject: RE: Prediction using Classification with text attributes in
> Apache Spark MLLib
>
> Hi Alexander,
> Just one more question on a related note. Should I be following the same
> procedure even if my data is nominal (categorical), but having a lot of
> combinations? (In Weka I used to have it as nominal data)
>
> Regards,
> -lmk
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8249.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by "Ulanov, Alexander" <al...@hp.com>.

Hi Imk,

I am not aware of any classifier in MLLib that accept nominal type of data. They do accept RDD of LabeledPoints, which are label + vector of Double. So, you'll need to convert nominal to double.

Best regards, Alexander

-----Original Message-----
From: lmk [mailto:lakshmi.muralikrishnan@gmail.com] 
Sent: Wednesday, June 25, 2014 1:27 PM
To: user@spark.incubator.apache.org
Subject: RE: Prediction using Classification with text attributes in Apache Spark MLLib

Hi Alexander,
Just one more question on a related note. Should I be following the same procedure even if my data is nominal (categorical), but having a lot of combinations? (In Weka I used to have it as nominal data)

Regards,
-lmk

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by lmk <la...@gmail.com>.

Hi Alexander,
Just one more question on a related note. Should I be following the same
procedure even if my data is nominal (categorical), but having a lot of
combinations? (In Weka I used to have it as nominal data)

Regards,
-lmk



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by "Ulanov, Alexander" <al...@hp.com>.

Hi Imk,

There is a number of libraries and scripts to convert text to libsvm format, if you just type " libsvm format converter" in search engine. Unfortunately I cannot recommend a specific one, except the one that is built in Weka. I use it for test purposes, and for big experiments it is easier to write your own converter. Format is simple enough. However, I hope that such tool will be implemented in Spark MLLib someday, because it will benefit from parallel processing.

Best regards, Alexander

-----Original Message-----
From: lmk [mailto:lakshmi.muralikrishnan@gmail.com] 
Sent: Tuesday, June 24, 2014 3:41 PM
To: user@spark.incubator.apache.org
Subject: RE: Prediction using Classification with text attributes in Apache Spark MLLib

Hi Alexander,
Thanks for your prompt response. Earlier I was executing this Prediction using Weka only. But now we are moving to a huge dataset and hence to Apache Spark MLLib. Is there any other way to convert to libSVM format? Or is there any other simpler algorithm that I can use in mllib?

Thanks,
lmk

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by lmk <la...@gmail.com>.

Hi Alexander,
Thanks for your prompt response. Earlier I was executing this Prediction
using Weka only. But now we are moving to a huge dataset and hence to Apache
Spark MLLib. Is there any other way to convert to libSVM format? Or is there
any other simpler algorithm that I can use in mllib?

Thanks,
lmk



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

Posted by "Ulanov, Alexander" <al...@hp.com>.

Hi,

You need to convert your text to vector space model: http://en.wikipedia.org/wiki/Vector_space_model
and then pass it to SVM. As far as I know, in previous versions of MLlib there was a special class for doing this: https://github.com/amplab/MLI/blob/master/src/main/scala/feat/NGrams.scala. It is not compatible with Spark 1.0.
I wonder why MLLib folks didn't include it in newer versions of Spark.

As a workaround, you could use a separate tool to convert your data to LibSVM format http://stats.stackexchange.com/questions/61328/libsvm-data-format, and then load it with MLUtils.loadLibSVMFile. For example, you could use Weka http://www.cs.waikato.ac.nz/ml/weka/  (it has friendly UI but doesn't handle big datasets) to convert your file.

Best regards, Alexander

-----Original Message-----
From: lmk [mailto:lakshmi.muralikrishnan@gmail.com] 
Sent: Tuesday, June 24, 2014 3:17 PM
To: user@spark.incubator.apache.org
Subject: Prediction using Classification with text attributes in Apache Spark MLLib

Hi,
I am trying to predict an attribute with binary value (Yes/No) using SVM.
All my attributes which belong to the training set are text attributes. 
I understand that I have to convert my outcome as double (0.0/1.0). But I donot understand how to deal with my explanatory variables which are also text.
Please let me know how I can do this.

Thanks.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.