You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "kian.ho" <hu...@gmail.com> on 2015/03/15 04:51:06 UTC

order preservation with RDDs

Hi, I was taking a look through the mllib examples in the official spark
documentation and came across the following: 
http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2

specifically the lines:

label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
...
...
data1 = label.zip(scaler1.transform(features))

my question:
wouldn't it be possible that some labels in the pairs returned by the
label.zip(..) operation are not paired with their original features? i.e.
are the original orderings of `labels` and `features` preserved after the
scaler1.transform(..) and label.zip(..) operations?

This issue was also mentioned in
http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html

I would greatly appreciate some clarification on this, as I've run into this
issue whilst experimenting with feature extraction for text classification,
where (correct me if I'm wrong) there is no built-in mechanism to keep track
of document-ids through the HashingTF and IDF fitting and transformations.

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: order preservation with RDDs

Posted by "kian.ho" <hu...@gmail.com>.

For those still interested, I raised this issue on JIRA and received an
official response:

https://issues.apache.org/jira/browse/SPARK-6340



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: order preservation with RDDs

Posted by Sean Owen <so...@cloudera.com>.

Yes I don't think this is entirely reliable in general. I would emit
(label,features) pairs and then transform the values.

In practice, this may happen to work fine in simple cases.

On Sun, Mar 15, 2015 at 3:51 AM, kian.ho <hu...@gmail.com> wrote:
> Hi, I was taking a look through the mllib examples in the official spark
> documentation and came across the following:
> http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2
>
> specifically the lines:
>
> label = data.map(lambda x: x.label)
> features = data.map(lambda x: x.features)
> ...
> ...
> data1 = label.zip(scaler1.transform(features))
>
> my question:
> wouldn't it be possible that some labels in the pairs returned by the
> label.zip(..) operation are not paired with their original features? i.e.
> are the original orderings of `labels` and `features` preserved after the
> scaler1.transform(..) and label.zip(..) operations?
>
> This issue was also mentioned in
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html
>
> I would greatly appreciate some clarification on this, as I've run into this
> issue whilst experimenting with feature extraction for text classification,
> where (correct me if I'm wrong) there is no built-in mechanism to keep track
> of document-ids through the HashingTF and IDF fitting and transformations.
>
> Thanks.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org