You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mdagost <gi...@git.apache.org> on 2014/10/02 23:43:45 UTC

[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

GitHub user mdagost opened a pull request:

    https://github.com/apache/spark/pull/2636

    SPARK-3770: Make userFeatures accessible from python

    https://issues.apache.org/jira/browse/SPARK-3770
    
    We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mdagost/spark mf_user_features

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2636.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2636
    
----
commit e1fbe5e82a6b9436ce745175670cd005f6481173
Author: Michelangelo D'Agostino <md...@civisanalytics.com>
Date:   2014-10-02T13:33:45Z

    Added scala function to stringify userFeatures for access in python.

commit cdd98e3a43cc465844a3b38432f4edc679ffa0dd
Author: Michelangelo D'Agostino <md...@civisanalytics.com>
Date:   2014-10-02T16:05:48Z

    It's working now.

commit 34cb2a2889649e3f29f1686745320884f1fbc945
Author: Michelangelo D'Agostino <md...@civisanalytics.com>
Date:   2014-10-02T21:41:51Z

    A couple of lint cleanups and a comment.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-58243752
  
    @MLnick  @mdagost  There are a few functions available which you could use for the serialization, but PythonRDD.javaToPython might be a good option.  You can see example usage in recommendation.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-58572881
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-57715181
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59956077
  
    this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59967277
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21994/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mdagost <gi...@git.apache.org>.
Github user mdagost commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59769268
  
    @MLnick It doesn't look like `pairRDDToPython` does the trick.  I tried
    
    ```{python}
    def userFeatures(self):
        juf = self._java_model.userFeatures()                                                                                                                                                
        juf = sc._jvm.SerDeUtil.pairRDDToPython(juf, 1)
        return juf
    ```
    
    but what comes out when I try to print the result of taking the first element of the RDD is just "[[B@176fa1a5" rather than any kind of nicely formatted python object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-57905979
  
    Can we use the existing `pairRDDToPython ` function? 
    
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala#L120


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mdagost <gi...@git.apache.org>.
Github user mdagost commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59935805
  
    @mengxr Unit tests are added.  I get some unrelated test failures on my local (everything in `recommendation.py`, including the new stuff, passes.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-57813143
  
    @mdagost @mengxr We use Pyrolite to convert Java objects into Python objects, you can get the type mapping here: https://github.com/irmen/Pyrolite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59967265
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21994/consoleFull) for   PR 2636 at commit [`c98f9e2`](https://github.com/apache/spark/commit/c98f9e22a87b640b9787e054067a49506aabf2b6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-57785126
  
    @mdagost If you convert `(Int, Array[Double])` to a `java.util.List<Object>` (id the first and features the second (without converting to string)), you should be able to get the data correctly on the Python side. If that works, could you add `productFeatures` as well? Thanks!
    
    @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-57855517
  
    We still need this wrapper, but RDD[Array[Object]] is only used for Python API, so it's better to put it in PythonMLLibAPI, maybe more general, like fromTupleRDD, which will convert any RDD[Tuple[_,_]] into RDD[Array[Any]], Any is similar to Java Object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59889793
  
    @mdagost Thanks for working on the SerDe! I tested it locally and it works correctly, but the unit tests for the added methods are missing. Do you mind adding them? You can follow
    
    https://github.com/mdagost/spark/blob/mf_user_features/python/pyspark/mllib/recommendation.py#L55
    
    Basically, we want to verify that userFeatures/productFeatures returns an RDD of key-value pairs with the correct number of records and for each records the feature dimension is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mdagost <gi...@git.apache.org>.
Github user mdagost commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59927704
  
    Whoops.  Forgot the tests :)  I'll work on those today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mdagost <gi...@git.apache.org>.
Github user mdagost commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59784360
  
    @davies Your idea of adding something like `fromTupleRDD` to `PythonMLLibAPI` seems to be the way to go.  I'm just doing some cleanup and will push `userFeatures` and `productFeatures` in just a bit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2636


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mdagost <gi...@git.apache.org>.
Github user mdagost commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-57840219
  
    I'm totally new to Spark, so sorry if these are all dumb questions.  
    
    Are you suggesting that I convert the userFeatures `RDD[(Int, Array[Double])]` to `RDD[Array[Object]]` ?  If so, do you want a helper function for doing that like I did for the string helper, or should I convert the main userFeatures to be of that type?
    
    Also, I'm sure this is dumb, but what exact type of `Object` are we talking about?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59957012
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21994/consoleFull) for   PR 2636 at commit [`c98f9e2`](https://github.com/apache/spark/commit/c98f9e22a87b640b9787e054067a49506aabf2b6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mdagost <gi...@git.apache.org>.
Github user mdagost commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-58272430
  
    I've been having trouble getting either `PythonRDD.javaToPython` or `pairRDDToPython` to work.  But porting the general function I wrote from `MatrixFactorizationModel.scala` to `PythonMLLibAPI` is also giving me some trouble.  I'll get back to it later this week and try to make some progress...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3770: Make userFeatures accessible from ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2636#issuecomment-59956125
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org