You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yao <yg...@ford.com> on 2014/12/29 04:37:01 UTC

Re: Using TF-IDF from MLlib

I found the TF-IDF feature extraction and all the MLlib code that work with
pure Vector RDD very difficult to work with due to the lack of ability to
associate vector back to the original data. Why can't Spark MLlib support
LabeledPoint? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using TF-IDF from MLlib

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098

On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> I vaguely remember that JIRA and AFAIK Matei's point was that the order is
> not guaranteed *after* a shuffle. If you only use operations like map which
> preserve partitioning, ordering should be guaranteed from what I know.
>
> On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Dang I can't seem to find the JIRA now but I am sure we had a discussion
>> with Matei about this and the conclusion was that RDD order is not
>> guaranteed unless a sort is involved.
>> On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jo...@databricks.com> wrote:
>>
>>> This was brought up again in
>>> https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
>>> item which was asked about the reliability of zipping RDDs.  Basically, it
>>> should be reliable, and if it is not, then it should be reported as a bug.
>>> This general approach should work (with explicit types to make it clear):
>>>
>>> val data: RDD[LabeledPoint] = ...
>>> val labels: RDD[Double] = data.map(_.label)
>>> val features1: RDD[Vector] = data.map(_.features)
>>> val features2: RDD[Vector] = new
>>> HashingTF(numFeatures=100).transform(features1)
>>> val features3: RDD[Vector] = idfModel.transform(features2)
>>> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
>>> features) => LabeledPoint(label, features))
>>>
>>> If you run into problems with zipping like this, please report them!
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>>> Hopefully the new pipeline API addresses this problem. We have a code
>>>> example here:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>>>>
>>>> -Xiangrui
>>>>
>>>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <an...@gmail.com>
>>>> wrote:
>>>> > Here is what I did for this case :
>>>> https://github.com/andypetrella/tf-idf
>>>> >
>>>> >
>>>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
>>>> >
>>>> >> Given (label, terms) you can just transform the values to a TF
>>>> vector,
>>>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
>>>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
>>>> >> looking for?
>>>> >>
>>>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
>>>> >> > I found the TF-IDF feature extraction and all the MLlib code that
>>>> work
>>>> >> > with
>>>> >> > pure Vector RDD very difficult to work with due to the lack of
>>>> ability
>>>> >> > to
>>>> >> > associate vector back to the original data. Why can't Spark MLlib
>>>> >> > support
>>>> >> > LabeledPoint?
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > View this message in context:
>>>> >> >
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
>>>> >> > Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >> >
>>>> >> >
>>>> ---------------------------------------------------------------------
>>>> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> >> > For additional commands, e-mail: user-help@spark.apache.org
>>>> >> >
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>> >>
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>

Re: Using TF-IDF from MLlib

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you only use operations like map which
preserve partitioning, ordering should be guaranteed from what I know.

On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen <so...@cloudera.com> wrote:

> Dang I can't seem to find the JIRA now but I am sure we had a discussion
> with Matei about this and the conclusion was that RDD order is not
> guaranteed unless a sort is involved.
> On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jo...@databricks.com> wrote:
>
>> This was brought up again in
>> https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
>> item which was asked about the reliability of zipping RDDs.  Basically, it
>> should be reliable, and if it is not, then it should be reported as a bug.
>> This general approach should work (with explicit types to make it clear):
>>
>> val data: RDD[LabeledPoint] = ...
>> val labels: RDD[Double] = data.map(_.label)
>> val features1: RDD[Vector] = data.map(_.features)
>> val features2: RDD[Vector] = new
>> HashingTF(numFeatures=100).transform(features1)
>> val features3: RDD[Vector] = idfModel.transform(features2)
>> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
>> features) => LabeledPoint(label, features))
>>
>> If you run into problems with zipping like this, please report them!
>>
>> Thanks,
>> Joseph
>>
>> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>>> Hopefully the new pipeline API addresses this problem. We have a code
>>> example here:
>>>
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>>>
>>> -Xiangrui
>>>
>>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <an...@gmail.com>
>>> wrote:
>>> > Here is what I did for this case :
>>> https://github.com/andypetrella/tf-idf
>>> >
>>> >
>>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
>>> >
>>> >> Given (label, terms) you can just transform the values to a TF vector,
>>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
>>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
>>> >> looking for?
>>> >>
>>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
>>> >> > I found the TF-IDF feature extraction and all the MLlib code that
>>> work
>>> >> > with
>>> >> > pure Vector RDD very difficult to work with due to the lack of
>>> ability
>>> >> > to
>>> >> > associate vector back to the original data. Why can't Spark MLlib
>>> >> > support
>>> >> > LabeledPoint?
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > View this message in context:
>>> >> >
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
>>> >> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >> >
>>> >> >
>>> ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >> > For additional commands, e-mail: user-help@spark.apache.org
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>

Re: Using TF-IDF from MLlib

Posted by Sean Owen <so...@cloudera.com>.

Dang I can't seem to find the JIRA now but I am sure we had a discussion
with Matei about this and the conclusion was that RDD order is not
guaranteed unless a sort is involved.
On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jo...@databricks.com> wrote:

> This was brought up again in
> https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
> which was asked about the reliability of zipping RDDs.  Basically, it
> should be reliable, and if it is not, then it should be reported as a bug.
> This general approach should work (with explicit types to make it clear):
>
> val data: RDD[LabeledPoint] = ...
> val labels: RDD[Double] = data.map(_.label)
> val features1: RDD[Vector] = data.map(_.features)
> val features2: RDD[Vector] = new
> HashingTF(numFeatures=100).transform(features1)
> val features3: RDD[Vector] = idfModel.transform(features2)
> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
> features) => LabeledPoint(label, features))
>
> If you run into problems with zipping like this, please report them!
>
> Thanks,
> Joseph
>
> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hopefully the new pipeline API addresses this problem. We have a code
>> example here:
>>
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>>
>> -Xiangrui
>>
>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <an...@gmail.com>
>> wrote:
>> > Here is what I did for this case :
>> https://github.com/andypetrella/tf-idf
>> >
>> >
>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
>> >
>> >> Given (label, terms) you can just transform the values to a TF vector,
>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
>> >> looking for?
>> >>
>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
>> >> > I found the TF-IDF feature extraction and all the MLlib code that
>> work
>> >> > with
>> >> > pure Vector RDD very difficult to work with due to the lack of
>> ability
>> >> > to
>> >> > associate vector back to the original data. Why can't Spark MLlib
>> >> > support
>> >> > LabeledPoint?
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
>> >> > Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> > For additional commands, e-mail: user-help@spark.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Using TF-IDF from MLlib

Posted by Joseph Bradley <jo...@databricks.com>.

This was brought up again in
https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
which was asked about the reliability of zipping RDDs.  Basically, it
should be reliable, and if it is not, then it should be reported as a bug.
This general approach should work (with explicit types to make it clear):

val data: RDD[LabeledPoint] = ...
val labels: RDD[Double] = data.map(_.label)
val features1: RDD[Vector] = data.map(_.features)
val features2: RDD[Vector] = new
HashingTF(numFeatures=100).transform(features1)
val features3: RDD[Vector] = idfModel.transform(features2)
val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
features) => LabeledPoint(label, features))

If you run into problems with zipping like this, please report them!

Thanks,
Joseph

On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hopefully the new pipeline API addresses this problem. We have a code
> example here:
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>
> -Xiangrui
>
> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <an...@gmail.com>
> wrote:
> > Here is what I did for this case :
> https://github.com/andypetrella/tf-idf
> >
> >
> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
> >
> >> Given (label, terms) you can just transform the values to a TF vector,
> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
> >> looking for?
> >>
> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
> >> > I found the TF-IDF feature extraction and all the MLlib code that work
> >> > with
> >> > pure Vector RDD very difficult to work with due to the lack of ability
> >> > to
> >> > associate vector back to the original data. Why can't Spark MLlib
> >> > support
> >> > LabeledPoint?
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> > For additional commands, e-mail: user-help@spark.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Using TF-IDF from MLlib

Posted by Xiangrui Meng <me...@gmail.com>.

Hopefully the new pipeline API addresses this problem. We have a code
example here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

-Xiangrui

On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <an...@gmail.com> wrote:
> Here is what I did for this case : https://github.com/andypetrella/tf-idf
>
>
> Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
>
>> Given (label, terms) you can just transform the values to a TF vector,
>> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
>> make a LabeledPoint from (label, vector) pairs. Is that what you're
>> looking for?
>>
>> On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
>> > I found the TF-IDF feature extraction and all the MLlib code that work
>> > with
>> > pure Vector RDD very difficult to work with due to the lack of ability
>> > to
>> > associate vector back to the original data. Why can't Spark MLlib
>> > support
>> > LabeledPoint?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using TF-IDF from MLlib

Posted by andy petrella <an...@gmail.com>.

Here is what I did for this case : https://github.com/andypetrella/tf-idf

Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :

> Given (label, terms) you can just transform the values to a TF vector,
> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
> make a LabeledPoint from (label, vector) pairs. Is that what you're
> looking for?
>
> On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
> > I found the TF-IDF feature extraction and all the MLlib code that work
> with
> > pure Vector RDD very difficult to work with due to the lack of ability to
> > associate vector back to the original data. Why can't Spark MLlib support
> > LabeledPoint?
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Using TF-IDF from MLlib

Posted by Sean Owen <so...@cloudera.com>.

Given (label, terms) you can just transform the values to a TF vector,
then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
make a LabeledPoint from (label, vector) pairs. Is that what you're
looking for?

On Mon, Dec 29, 2014 at 3:37 AM, Yao <yg...@ford.com> wrote:
> I found the TF-IDF feature extraction and all the MLlib code that work with
> pure Vector RDD very difficult to work with due to the lack of ability to
> associate vector back to the original data. Why can't Spark MLlib support
> LabeledPoint?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org