You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by datumbox <gi...@git.apache.org> on 2017/02/24 23:30:34 UTC

[GitHub] spark pull request #17059: Removed unnecessary castings and refactored check...

GitHub user datumbox opened a pull request:

    https://github.com/apache/spark/pull/17059

    Removed unnecessary castings and refactored checked casts in ALS.

    ## What changes were proposed in this pull request?
    
    The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.
    
    ## How was this patch tested?
    
    I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/datumbox/spark als_casting_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17059
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    This one would be good for @mlnick to look at.
    
    The Scala Long will match java.lang.Number right? if so it works even if there's another conversion there, but maybe that one is trivial, and no more common a case to optimize for than others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick @srowen @imatiach-msft Thank you for your comments and for helping making this improvement robust. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73718/testReport)** for PR 17059 at commit [`3050f6e`](https://github.com/apache/spark/commit/3050f6eeda769127196e8d1ad4b432b92af0ea7c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick Cool did not know that. Just updated commit messages. I think I'm good with this PR from my side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft This is already used in other parts of MLlib. Have a look here: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala#L311


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #3582 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3582/testReport)** for PR 17059 at commit [`98d6481`](https://github.com/apache/spark/commit/98d6481cc565b8807b6b953771cd8ac41d0c923f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox I've added some additional comments with regards to the fractional part, please take a look, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17059


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73585/testReport)** for PR 17059 at commit [`fc3acfb`](https://github.com/apache/spark/commit/fc3acfb1cd8e02f714425b79e5ef7cca4a529e26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103547192
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    sorry, with regards to the fractional part, computations like:
    0.764553836902861+0.0701367068431045+0.165309456254034 == 1
    fail -- I strongly encourage to use a threshold for floating point numbers from IEEE-754 standard, since these values could be coming from anywhere - eg, they might have been in a sql database or some other location and when we load we can get strange errors.  For example, the following test case used to work in the previous code:
    
      test("verify can run on double values") {
        val spark = this.spark
        import spark.implicits._
    
        val als = new ALS().setMaxIter(1).setRank(1)
        val df = Seq(
          (0D, 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 3.0),
          (0D, 0D, 2.0),
          (0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 0D, 5.0)
        ).toDF("user", "item", "rating")
        df.show()
        val model = als.fit(df)
      }
    
    with the new code it fails with error:
    
    Caused by: java.lang.IllegalArgumentException: ALS only supports values in Integer range for columns user and item. Value 0.9999999999999996 was out of Integer range.
    
    This value was very close to 1 however, and mathematically the values sum to 1: 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, but due to IEEE floating point representation we can run into such issues.
    If we do insist on breaking existing users by removing the rounding, we should at least allow them an escape hatch to specify the precision.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    You are effectively handling the cases that casting handles. Doesn't a Scala long get boxed another time now? I bet it works, just wondering why that's not handled like Int. I'm neutral on this and wouldn't merge it myself but others are welcome to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Could you explain what you mean by "duplicating"? It is safe with Scala Long; I did lots of tests to ensure that it works well. If you require any changes, I'm happy to update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103676000
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -205,6 +206,69 @@ class ALSSuite
         assert(decompressed.toSet === expected)
       }
     
    +  test("CheckedCast") {
    +    val checkedCast = new ALS().checkedCast
    +    val df = spark.sqlContext.range(1)
    --- End diff --
    
    This can just be `spark.range(1)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    I get it, but is that cast actually slowing things down measurably? because this change also adds some overhead in the checks it does. I think it's better to let the cast to double deal with the source nature of the number consistently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft thanks for reviewing and requesting the test case, I was going to ask for one :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73697/testReport)** for PR 17059 at commit [`0b49a50`](https://github.com/apache/spark/commit/0b49a50fd071b65b0887f029a04b0a6f33a684a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    That's compelling regarding performance. It's not big but not trivial.
    My remaining concern is whether you're handling all the cases the original did. `Number` covers a lot but does it include all the SQL decimal types? I think the common casting mechanism would cover those. Granted, it's odd if someone tries to use those values, but ideally would not work differently after this change.
    
    I am still not sure about the `Long` case -- don't you want to handle Scala Long like Int here for efficiency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox I added an additional comment -- since you believe we should keep the current code, we should add a test case and slightly modify the error message to let the user know what other error condition they are hitting, please take a look, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen I believe that this needs to be fixed for 2 reasons:
    1. Casting the ids to double just to convert it back to integer is not an elegant solution and it is rather confusing.
    2. The double casting puts more strain on the garbage collector and I've personally measured it in an earlier version with and without the hack.
    
    Finally I do not believe the proposed fix slows down things as it does a similar number of comparisons as the original code. If you still believe this is not worth it feel free to close the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73620/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103378267
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    Sorry, I'm not sure if this is a good idea due to floating point precision... the code above doesn't seem to do this check, it just calls toInt -- however, if this is absolutely necessary, I would hope that we could give the user some way to specify the Int range or precision.  Also, if we are going to go ahead with this change, then we should add some tests to verify the case that the exception is thrown, but without some ability to specify the precision I'm not sure if this is the correct thing to do (?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73706/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73718/testReport)** for PR 17059 at commit [`3050f6e`](https://github.com/apache/spark/commit/3050f6eeda769127196e8d1ad4b432b92af0ea7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft Thanks mate. Show it and replied. I really don't mind it either way to be honest.
    
    @MLnick @srowen I am not sure if this PR will make it to Spark 2.2 due to the upcoming code freeze... So, there is one more tiny little thing that was always bothering me in the original implementation which is unrelated to the casting but fixing it and sending a separate PR is just embarassing... The default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714). Do you mind it if I change the default of train() to 0.1 and sneak it in with this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Concerning the failed scala style test, this is caused by the long in-line comment that I added to explain the if statement. If you decide to approve the PR, I can remove it. Cheers! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17059

The discussion around numeric types was at https://github.com/apache/spark/pull/12762#r61539327. The checked cast was introduced because it was previously implicitly cast to `Int` anyway, and would mangle `Long` ids without any warning. I did think about only supporting `Int` & `Long` there - but we decided to just support any numeric type within range. I guess it was around the time that we were porting most pipeline components to support all numeric input types.

In hindsight I would have just gone for restricting it to `Int` & `Long`.

In reality in the case of ALS users really should not be using `Double` for ids (e.g. an id of `123.4` is clearly wrong, `123.0` could occur I guess more by accident or poor schema, but in that case we support it here). So I think the likelihood of any real impact from being restrictive in terms of fractionals is really low.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    I get it, though the drawback is mostly that you've duplicated in a way the machinery that would construe any number as something comparable to the min/max int value. (Does this not miss the case of a Scala Long?) I like that it catches the case of Double values that aren't integral, but, that's quite a corner case anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged to master. Thanks @datumbox, and everyone for reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73693/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: Removed unnecessary castings and refactored checked cast...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73620/testReport)** for PR 17059 at commit [`eb33189`](https://github.com/apache/spark/commit/eb331897e4a70c2d850bdfe6ef7bf9c1acb50653).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103392500
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft: In this snippet we deal with Uset id and Item id. Those things should no have fractional bits. What I do here is convert the number into Integer and compare its double value. If the values are identical one of the following is true:
    - The value is Byte or Short.
    - The value is Long but within the Integer range.
    - The value is Double or Float but without any fractional part.
    
    I think this snippet is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103547149
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    sorry, with regards to the fractional part, computations like:
    0.764553836902861+0.0701367068431045+0.165309456254034 == 1
    fail -- I strongly encourage to use a threshold for floating point numbers from IEEE-754 standard, since these values could be coming from anywhere - eg, they might have been in a sql database or some other location and when we load we can get strange errors.  For example, the following test case used to work in the previous code:
    
      test("verify can run on double values") {
        val spark = this.spark
        import spark.implicits._
    
        val als = new ALS().setMaxIter(1).setRank(1)
        val df = Seq(
          (0D, 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 3.0),
          (0D, 0D, 2.0),
          (0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 0D, 5.0)
        ).toDF("user", "item", "rating")
        df.show()
        val model = als.fit(df)
      }
    
    with the new code it fails with error:
    
    Caused by: java.lang.IllegalArgumentException: ALS only supports values in Integer range for columns user and item. Value 0.9999999999999996 was out of Integer range.
    
    This value was very close to 1 however, and mathematically the values sum to 1: 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, but due to IEEE floating point representation we can run into such issues.
    If we do insist on breaking existing users by removing the rounding, we should at least allow them an escape hatch to specify the precision.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103559188
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @datumbox Sorry the link doesn't seem to work for me.  If you could add it to the ALSSuite that would be great.  Oh no, I don't mean to have a different check, just to have a clearer exception message - something like "throw new IllegalArgumentException(s"ALS only supports values in Integer range and without fractional part for columns ${$(userCol)} and ${$(itemCol)}. Value $n was either out of Integer range or contained a fractional part that could not be converted.") "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen Yes, the behaviour of the method remains the same. This patch helped me get a measurable improvement on GC overhead in Spark 2.0, so I though that it would be beneficial for others. Anyway thanks for the comments. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen @mlnick I updated the PR based on what was discussed above and I tested it again on Spark 2.1. I also updated the coding styles and the exception message. 
    
    The changes requested by Sean, should not affect the speed of the proposed solution but rather help us catch all the corner cases. I tested this by generating a fresh batch of 1000 samples using the latest implementation ([details here](http://pastebin.com/vbbX1c7p)). The mean of the sample is 4.53837 which is no different than the mean of the code that I committed yesterday (p-value 0.4265 - H0 of equal means can't be rejected). So we are good.
    
    Could any of the admins help re-test this because the "Jenkins, retest this please" did not work for me earlier?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen The following snippet handles explicitly Longs. It can be rewritten to remove duplicate code by introducing bools for overflow detection but I don't think it is worth it. In theory you can catch also explicitly other types such as Byte and Short but I think that's an overkill.
    
    As far as I saw, all SQL numerical types inherit from Number so comparing their doubleValue with their intValue would be enough to check if they are within integer range. 
    
    ```scala
        val u = udf { (n: Any) =>
          n match {
            case v: Int => v
            case v: Long =>
              val intV = v.intValue
              if (v == intV) {
                intV
              }
              else {
                throw new IllegalArgumentException("out of range")
              }
            //case v: Byte => v.toInt 
            //case v: Short => v.toInt
            case v: Number =>
              val intV = v.intValue
              if (v.doubleValue == intV) {
                intV
              }
              else {
                throw new IllegalArgumentException("out of range")
              }
            case _ => throw new IllegalArgumentException("invalid type")
          }
        }
    ```
    
    Personally, I would remove the explicit Long case as it introduces duplicate code and does not help match. The remaining snippet avoids doing any casting if the ID is integer (which should be the majority of cases and yields the biggest memory/speed gains) or non-numeric and handles all corner cases (All scala/java numeric types + SQL Numerics). Agree?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox left a few minor comments to clean up. Overall looks good. I think it is fine for 2.2.
    
    For the regParam even though it is a minor thing it would be best to just create a new JIRA & PR for it separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73706/testReport)** for PR 17059 at commit [`361d9cb`](https://github.com/apache/spark/commit/361d9cb30fe87a2bd21a2476a7b549f14878df3c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103557982
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft I already tested it, check this snippet posted [here](https://github.com/apache/spark/pull/17059#discussion_r103554415). I can add it in ALSSuite if necessary. The exception message is technically correct since a number with a fractional part is not in Integer range (plus you get to see the actual value of the number). Returning a different exception would require having two separate if checks instead of one. Do we really want that?
    
    Guys to be honest this PR solves a very simple thing and I did not anticipate to be so controvercial. May I suggest we agree on the final set of changes and merge it? Perhaps we can tackle any other concerns on new pull-requests? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73585/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103550608
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft I am aware of the implications of floating point precision and I understand your concerns. 
    
    Having said that though, even allowing user and item Ids to be double/float is not a good idea. We just keep it for backwards compatibility I guess. Also note that the current implementation of Spark 2.1 will actually take that your 0.9999999999999996 value and silently cast it to Int (so it becomes 0)! For me the only permitted types should have been Integer, Long and BigIntegers.
    
    I don't have strong opinions about refactoring anything in the Number case as it simply performance-wise it does not matter. The point of this PR is to optimize the general case where the id is Int because the casting of the current approach algenerates twice as much data as the original dataset (of course it is GCed by at a cost). 
    
    @MLnick @srowen It's your call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick Yeap! I have run into problems with the original implementation when dealing with several billions records. This is when removing casting starts paying off. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73697/testReport)** for PR 17059 at commit [`0b49a50`](https://github.com/apache/spark/commit/0b49a50fd071b65b0887f029a04b0a6f33a684a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick I accepted all proposed changes. Please retest. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Ok, let me take a look at this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regression tests [here](https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing) - admittedly vs `1.6.1` at that time


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73585/testReport)** for PR 17059 at commit [`fc3acfb`](https://github.com/apache/spark/commit/fc3acfb1cd8e02f714425b79e5ef7cca4a529e26).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.

Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen: Thanks for the comments. We are getting there. :)
    
    I will handle the Long case as you suggest. 
    
    If you think people use SQL decimal types, I can include them at the end of the pattern matching. This will lead to some duplicate code though cause I need to write the same if statement twice. Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox I like the changes, I just had a minor concern about the code where we call v.intValue and then compare this to v.doubleValue -- due to precision issues, I'm not sure if this is desirable, since the data could come from any source and be slightly modified outside the Int range, and the previous code does not do this check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org