You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2014/07/13 16:58:05 UTC

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/1393

    SPARK-2465. Use long as user / item ID for ALS

    I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation.
    
    The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions.
    
    It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers.
    
    On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.
    
    Thoughts?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-2465

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1393
    
----
commit d4082ad7aa7468b63605d141f6d68e278983678a
Author: Sean Owen <so...@cloudera.com>
Date:   2014-07-13T14:57:15Z

    Use long instead of int for user/product IDs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49099667
  
    QA tests have started for PR 1393. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16694/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48842883
  
    QA tests have started for PR 1393. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1393#discussion_r14973237
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
    @@ -53,15 +53,15 @@ private[recommendation] case class OutLinkBlock(elementIds: Array[Int], shouldSe
      * we get product block b's message to update the corresponding users.
      */
     private[recommendation] case class InLinkBlock(
    -  elementIds: Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]])
    +  elementIds: Array[Long], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]])
     
     
     /**
      * :: Experimental ::
    - * A more compact class to represent a rating than Tuple3[Int, Int, Double].
    + * A more compact class to represent a rating than Tuple3[Long, Long, Double].
      */
     @Experimental
    -case class Rating(user: Int, product: Int, rating: Double)
    +case class Rating(user: Long, product: Long, rating: Double)
    --- End diff --
    
    Actually this is kind of tricky, we probably need a separate LongRating of some kind. This is because "user" and "product" and such are public fields and they used to be Int.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49106922
  
    @mateiz yeah it was `@Experimental` but if you want to preserve the API of this class I can only imagine it would have a `Long` field internally but still also expose accessors of type `Int` as well with the current name, which just truncate or something. It still works the same for values that fit in an `Int` I suppose. I can try that to see how it looks. 
    
    I'd not be offended if this is better considered for 2.x, after more evidence exists whether this is really causing problems or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48845420
  
    QA results for PR 1393:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Rating(user: Long, product: Long, rating: Double)<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49013972
  
    Yes you could also tell callers to track their own user-ID mapping and maintain it consistently everywhere. Callers have to share that state then somehow. Hashing is easier, and 64 bits makes it work for practical purposes. 
    
    A caller has to do something like these to deal with real-world identifiers because an `Int` ID API by itself doesn't quite work. This is an instance of a meta-concern I have, if an API which (from my perspective) is going to be problematic at scale is already unchangeable before battle-testing. (I actually thought all of MLlib was de facto `@Experimental`?)
    
    Yeah however you can layer on other APIs to fix it, or use `@deprecated` in cases like this to keep existing methods but add new signatures too. I think that would be the simplest solution to this particular concern.
    
    The question of serialized size is still out there. That is worth weighing in on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48850704
  
    QA tests have started for PR 1393. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49069526

No, MLlib is not experimental, only the parts annotated with @Exprimental are. The reason is that we felt we could continue supporting these low-level APIs indefinitely and add other ones later if we need to. Again, for real users, API stability matters *much* more than you'd think -- there's nothing more annoying than having to change your app to implement a software upgrade, and it causes fragmentation of the userbase as users stick to an older version instead of upgrading.

In this particular case, there are a few things we can do. We can think of additions to the API here that preserve the old methods but add new versions of predict. We can add a new class called LongALS or something like that, and have these ones call it and get back a LongMatrixFactorizationModel. Or we can offer a utility to generate unique IDs.

The reason I was asking about hash collisions above is that even with 64-bit IDs, you're not guaranteed to be collision-free. With 2-3 billion users you actually have a good chance of a collision. So if applications care about that, they may not be okay with this solution either.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49203419
  
    @mateiz I think it would mean mostly cloning `ALS.scala`, as the `Rating` object is woven throughout. Probably some large chunks could be refactored and shared. Is that what you mean? even I'm not sure if two APIs are worth the trouble. When I get to this point and see what it takes to make a 64-bit key implementation, yes I can propose what that looks like.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49149396
  
    Let me close this PR for now. I will fork or wrap as necessary. Keep it in mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more problems with the `Rating` class retrofit anyway.)
    
    Yes storage is the downside. Your comments on JIRA about effects of serialization in compressing away the difference are promising. I completely agree with using `Float` for ratings and even feature vectors.
    
    Yes I understand why random projections are helpful. It doesn't help accuracy, but may only trivially hurt it in return for some performance gain. If I have just 1 rating, it doesn't make my recs better to arbitrarily add your ratings to mine. Sure that's denser, and maybe you're getting less overfitting, but it's fitting the wrong input for both of us.
    
    A collision here and there is probably acceptable. One in a million customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to decide. If I'm an end user of MLlib bringing even millions of things to my model, I have to decide. And if it's a problem, have to maintain a lookup table to use it.
    
    It seemed simplest to moot the problem with a much bigger key space and engineer around the storage issue. A bit more memory is cheap; accuracy and engineer time are expensive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1393#discussion_r14973020
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
    @@ -53,15 +53,15 @@ private[recommendation] case class OutLinkBlock(elementIds: Array[Int], shouldSe
      * we get product block b's message to update the corresponding users.
      */
     private[recommendation] case class InLinkBlock(
    -  elementIds: Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]])
    +  elementIds: Array[Long], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]])
     
     
     /**
      * :: Experimental ::
    - * A more compact class to represent a rating than Tuple3[Int, Int, Double].
    + * A more compact class to represent a rating than Tuple3[Long, Long, Double].
      */
     @Experimental
    -case class Rating(user: Int, product: Int, rating: Double)
    +case class Rating(user: Long, product: Long, rating: Double)
    --- End diff --
    
    This is also a breaking change; we need to add another constructor that takes Ints. You may be able to add one like this:
    ```
    case class Rating(user: Long, product: Long, rating: Double) {
      def this(u: Int, p: Int, r: Double) = this(u.toLong, p.toLong, r)
    }
    ```
    Or if it doesn't work, you need to turn this into a class instead of a case class and add extends Serializable and the various constructors yourself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48848503
  
    QA results for PR 1393:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Rating(user: Long, product: Long, rating: Double)<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49210831
  
    Yeah, that's what I meant, we can clone it at first but we might be able to share code later (at least the math code we run on each block, or stuff like that). But let's do it only if you find out it's worth it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48999523
  
    I am trying to figure out why it happend, this might not be my conclusion but at the moment I feel that since this class has a private [mllib] constructor, there is an entry in ignores file as follows `org.apache.spark.mllib.recommendation.MatrixFactorizationModel.<init>`. This particular entry makes the whole class ignored by mima tool. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49107533
  
    QA results for PR 1393:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Rating(user: Long, product: Long, rating: Double)<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16694/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48846073
  
    QA tests have started for PR 1393. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49104938
  
    Wise words Matei! anyway here's another cut that preserves the original API. Tests are still running. Up to you guys' judgment on whether it's worthwhile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen closed the pull request at:

    https://github.com/apache/spark/pull/1393


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1393#discussion_r14972737
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala ---
    @@ -53,7 +53,7 @@ class MatrixFactorizationModel private[mllib] (
         * @param usersProducts  RDD of (user, product) pairs.
         * @return RDD of Ratings.
         */
    -  def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
    +  def predict(usersProducts: RDD[(Long, Long)]): RDD[Rating] = {
    --- End diff --
    
    Interesting, I think Rating shouldn't have been marked experimental. We should definitely fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48987125
  
    I'm somewhat worried about this because it seems difficult to do without breaking API changes. Do users really want to rely on luck to avoid collisions? It might be better to add some efficient way to assign unique IDs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1393#discussion_r14926032
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala ---
    @@ -53,7 +53,7 @@ class MatrixFactorizationModel private[mllib] (
         * @param usersProducts  RDD of (user, product) pairs.
         * @return RDD of Ratings.
         */
    -  def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
    +  def predict(usersProducts: RDD[(Long, Long)]): RDD[Rating] = {
    --- End diff --
    
    I had understood all of this to be an `@Experimental` API, though it is not consistently marked. For example, `Rating` is experimental but its API is actually bound to this API here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49072664

Yeah API stability is very important. I keep banging on about the flip-side -- freezing an API that may still need to change. You get a different important problem. I'm sure everyone gets that, and it's a judgment call and trade-off.

I will change the PR to preserve the existing methods and add new ones. That's the thing we can consider and merge or not. I'm not offended if nobody else is feeling this one. I can always fork/wrap this aspect to fit what I need it do. (And I have other API suggestions I'd rather spend time on if anything.)

I wouldn't want to add the overhead of a separate set of implementations just for 64 bit values. Users would have a hard time understanding the difference and choosing.

3 billion people is a lot! It could happen, yes. Maybe not with people but with, say, URLs. *If* collisions mattered much, then with many billions of things, you can't use the ALS implementation as it stands, since _most_ IDs would collide no matter how you map or hash. That's the best motivation I can offer for this change.

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1393#discussion_r14916107
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala ---
    @@ -53,7 +53,7 @@ class MatrixFactorizationModel private[mllib] (
         * @param usersProducts  RDD of (user, product) pairs.
         * @return RDD of Ratings.
         */
    -  def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] = {
    +  def predict(usersProducts: RDD[(Long, Long)]): RDD[Rating] = {
    --- End diff --
    
    This is a breaking API change unfortunately, and because of type erasure, users who compiled against an old MLlib (passing RDD[(Int, Int)]) will have their code link against the new one and get a ClassCastException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48999631
  
    And to my surprise also has `org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict` not sure why it has that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49131064

Besides breaking the API, I'm also worried about two things:

1. The increase in storage. We had some discussion before v1.0 about whether we should switch to long or not. ALS is not computation heavy for small k but communication heavy. I posted some screenshots on the JIRA page, where ALS shuffles ~200GB data in each iteration. With Long ids, this number may become ~300GB and hence ALS may slow down by 50%. Instead of upgrading the id type to Long, I'm actually thinking about downgrading the rating type to Float.

2. Is collision really bad? ALS needs somewhat "dense" matrix to compute good recommendations. If there are 3 billion users but each user only gives 1 or 2 ratings, ALS is very likely to overfit. In this case, making a random projection on the user side would certainly help, while hashing is one of the commonly used techniques for random projection. There will be bad recommendations no matter whether there exist hash collisions or not. So I'm really interested in some measurements on the downside of hash collision.

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49201924
  
    Sean, I'd still be okay with adding a LongALS class if you see benefit for it in some use cases. Let's just see how it works in comparison.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1393#discussion_r14916135
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala ---
    @@ -36,10 +36,10 @@ import org.apache.spark.mllib.api.python.PythonMLLibAPI
      */
     class MatrixFactorizationModel private[mllib] (
         val rank: Int,
    -    val userFeatures: RDD[(Int, Array[Double])],
    -    val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
    +    val userFeatures: RDD[(Long, Array[Double])],
    +    val productFeatures: RDD[(Long, Array[Double])]) extends Serializable {
       /** Predict the rating of one user for one product. */
    -  def predict(user: Int, product: Int): Double = {
    +  def predict(user: Long, product: Long): Double = {
    --- End diff --
    
    This is also a breaking API change, weird that Jenkins didn't catch it. @scrapcodes and @pwendell is this something we're not checking for right now in MIMA? Because this class is not experimental or anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48843270
  
    I think the most significant change is the Rating object. It goes from 8 + (ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 (object) + 4 (long) + 4 (long) + 8 (double) = 40 bytes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49000211
  
    Ahh understood (please ignore my previous theory.), so it happened because we have a function which is `@developerApi` in the same class with same name. So this was added by our GenerateMimaIgnoreTool to ignores file and as a result MIMA check for all predict methods was disabled. Not sure if mima can disambiguate them. I will check that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49104793
  
    I didn't suggest having a new implementation for long IDs, only a new API. They can run on the same implementation (e.g. the current Int-based one transforms the Ints to Longs and calls that one). This is a much more sensible way to evolve the API and it's very common in other software. All our MLlib APIs were designed to support this kind of evolution (e.g. you set your parameters using a builder pattern, where we can add new methods, and the top-level API is just functions you can call that we can easily map to more complex versions of the functions).
    
    The place I'm coming from is that there are *far* more complex APIs than ours that have retained backwards compatibility over decades, and were maintained by a similar-sized team. One great example is Java's class library, which is not only a great library but has also been compatible since 1.0. There are well-known ways to retain compatibility while still improving the API, such as adding a new package (e.g. java.nio vs java.io). I would be totally fine doing that with MLlib as we gain experience with it, but there's no reason to break the old API in the process. Again, I feel that people from today's tech company world think way too much about "perfecting" an API by repeatedly tweaking it, and while that works within a single engineering team, it doesn't work in software that you expect someone else to use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48852700
  
    QA results for PR 1393:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Rating(user: Long, product: Long, rating: Double)<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49210917
  
    BTW this could also be a place to use the dreaded Scala @specialized annotation to template the code for Ints vs Longs, though as far as I know that's being deprecated by the Scala developers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49110420
  
    Yeah, this is tricky. Maybe we need to create a LongRating class, even though it's ugly. @mengxr what do you think? I do think we'd like to have support for Longs here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-48843123
  
    The overall increase how much memory? Have a detailed contrast?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---