You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by datumbox <gi...@git.apache.org> on 2017/02/24 23:30:34 UTC

[GitHub] spark pull request #17059: Removed unnecessary castings and refactored check...

GitHub user datumbox opened a pull request:

    https://github.com/apache/spark/pull/17059

    Removed unnecessary castings and refactored checked casts in ALS.

    ## What changes were proposed in this pull request?
    
    The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.
    
    ## How was this patch tested?
    
    I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/datumbox/spark als_casting_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17059
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    This one would be good for @mlnick to look at.
    
    The Scala Long will match java.lang.Number right? if so it works even if there's another conversion there, but maybe that one is trivial, and no more common a case to optimize for than others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick @srowen @imatiach-msft Thank you for your comments and for helping making this improvement robust. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73718/testReport)** for PR 17059 at commit [`3050f6e`](https://github.com/apache/spark/commit/3050f6eeda769127196e8d1ad4b432b92af0ea7c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick Cool did not know that. Just updated commit messages. I think I'm good with this PR from my side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft This is already used in other parts of MLlib. Have a look here: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala#L311


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #3582 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3582/testReport)** for PR 17059 at commit [`98d6481`](https://github.com/apache/spark/commit/98d6481cc565b8807b6b953771cd8ac41d0c923f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox I've added some additional comments with regards to the fractional part, please take a look, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17059


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73585/testReport)** for PR 17059 at commit [`fc3acfb`](https://github.com/apache/spark/commit/fc3acfb1cd8e02f714425b79e5ef7cca4a529e26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103547192
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    sorry, with regards to the fractional part, computations like:
    0.764553836902861+0.0701367068431045+0.165309456254034 == 1
    fail -- I strongly encourage to use a threshold for floating point numbers from IEEE-754 standard, since these values could be coming from anywhere - eg, they might have been in a sql database or some other location and when we load we can get strange errors.  For example, the following test case used to work in the previous code:
    
      test("verify can run on double values") {
        val spark = this.spark
        import spark.implicits._
    
        val als = new ALS().setMaxIter(1).setRank(1)
        val df = Seq(
          (0D, 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 3.0),
          (0D, 0D, 2.0),
          (0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 0D, 5.0)
        ).toDF("user", "item", "rating")
        df.show()
        val model = als.fit(df)
      }
    
    with the new code it fails with error:
    
    Caused by: java.lang.IllegalArgumentException: ALS only supports values in Integer range for columns user and item. Value 0.9999999999999996 was out of Integer range.
    
    This value was very close to 1 however, and mathematically the values sum to 1: 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, but due to IEEE floating point representation we can run into such issues.
    If we do insist on breaking existing users by removing the rounding, we should at least allow them an escape hatch to specify the precision.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    You are effectively handling the cases that casting handles. Doesn't a Scala long get boxed another time now? I bet it works, just wondering why that's not handled like Int. I'm neutral on this and wouldn't merge it myself but others are welcome to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Could you explain what you mean by "duplicating"? It is safe with Scala Long; I did lots of tests to ensure that it works well. If you require any changes, I'm happy to update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103676000
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -205,6 +206,69 @@ class ALSSuite
         assert(decompressed.toSet === expected)
       }
     
    +  test("CheckedCast") {
    +    val checkedCast = new ALS().checkedCast
    +    val df = spark.sqlContext.range(1)
    --- End diff --
    
    This can just be `spark.range(1)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    I get it, but is that cast actually slowing things down measurably? because this change also adds some overhead in the checks it does. I think it's better to let the cast to double deal with the source nature of the number consistently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft thanks for reviewing and requesting the test case, I was going to ask for one :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73697/testReport)** for PR 17059 at commit [`0b49a50`](https://github.com/apache/spark/commit/0b49a50fd071b65b0887f029a04b0a6f33a684a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    That's compelling regarding performance. It's not big but not trivial.
    My remaining concern is whether you're handling all the cases the original did. `Number` covers a lot but does it include all the SQL decimal types? I think the common casting mechanism would cover those. Granted, it's odd if someone tries to use those values, but ideally would not work differently after this change.
    
    I am still not sure about the `Long` case -- don't you want to handle Scala Long like Int here for efficiency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox I added an additional comment -- since you believe we should keep the current code, we should add a test case and slightly modify the error message to let the user know what other error condition they are hitting, please take a look, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen I believe that this needs to be fixed for 2 reasons:
    1. Casting the ids to double just to convert it back to integer is not an elegant solution and it is rather confusing.
    2. The double casting puts more strain on the garbage collector and I've personally measured it in an earlier version with and without the hack.
    
    Finally I do not believe the proposed fix slows down things as it does a similar number of comparisons as the original code. If you still believe this is not worth it feel free to close the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73620/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103378267
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    Sorry, I'm not sure if this is a good idea due to floating point precision... the code above doesn't seem to do this check, it just calls toInt -- however, if this is absolutely necessary, I would hope that we could give the user some way to specify the Int range or precision.  Also, if we are going to go ahead with this change, then we should add some tests to verify the case that the exception is thrown, but without some ability to specify the precision I'm not sure if this is the correct thing to do (?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73706/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73718/testReport)** for PR 17059 at commit [`3050f6e`](https://github.com/apache/spark/commit/3050f6eeda769127196e8d1ad4b432b92af0ea7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft Thanks mate. Show it and replied. I really don't mind it either way to be honest.
    
    @MLnick @srowen I am not sure if this PR will make it to Spark 2.2 due to the upcoming code freeze... So, there is one more tiny little thing that was always bothering me in the original implementation which is unrelated to the casting but fixing it and sending a separate PR is just embarassing... The default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714). Do you mind it if I change the default of train() to 0.1 and sneak it in with this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Concerning the failed scala style test, this is caused by the long in-line comment that I added to explain the if statement. If you decide to approve the PR, I can remove it. Cheers! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    The discussion around numeric types was at https://github.com/apache/spark/pull/12762#r61539327. The checked cast was introduced because it was previously implicitly cast to `Int` anyway, and would mangle `Long` ids without any warning. I did think about only supporting `Int` & `Long` there - but we decided to just support any numeric type within range. I guess it was around the time that we were porting most pipeline components to support all numeric input types.
    
    In hindsight I would have just gone for restricting it to `Int` & `Long`.
    
    In reality in the case of ALS users really should not be using `Double` for ids (e.g. an id of `123.4` is clearly wrong, `123.0` could occur I guess more by accident or poor schema, but in that case we support it here). So I think the likelihood of any real impact from being restrictive in terms of fractionals is really low.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    I get it, though the drawback is mostly that you've duplicated in a way the machinery that would construe any number as something comparable to the min/max int value. (Does this not miss the case of a Scala Long?) I like that it catches the case of Double values that aren't integral, but, that's quite a corner case anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged to master. Thanks @datumbox, and everyone for reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73693/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: Removed unnecessary castings and refactored checked cast...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73620/testReport)** for PR 17059 at commit [`eb33189`](https://github.com/apache/spark/commit/eb331897e4a70c2d850bdfe6ef7bf9c1acb50653).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103392500
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft: In this snippet we deal with Uset id and Item id. Those things should no have fractional bits. What I do here is convert the number into Integer and compare its double value. If the values are identical one of the following is true:
    - The value is Byte or Short.
    - The value is Long but within the Integer range.
    - The value is Double or Float but without any fractional part.
    
    I think this snippet is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103547149
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    sorry, with regards to the fractional part, computations like:
    0.764553836902861+0.0701367068431045+0.165309456254034 == 1
    fail -- I strongly encourage to use a threshold for floating point numbers from IEEE-754 standard, since these values could be coming from anywhere - eg, they might have been in a sql database or some other location and when we load we can get strange errors.  For example, the following test case used to work in the previous code:
    
      test("verify can run on double values") {
        val spark = this.spark
        import spark.implicits._
    
        val als = new ALS().setMaxIter(1).setRank(1)
        val df = Seq(
          (0D, 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 3.0),
          (0D, 0D, 2.0),
          (0.764553836902861 + 0.0701367068431045 + 0.165309456254034, 0D, 5.0)
        ).toDF("user", "item", "rating")
        df.show()
        val model = als.fit(df)
      }
    
    with the new code it fails with error:
    
    Caused by: java.lang.IllegalArgumentException: ALS only supports values in Integer range for columns user and item. Value 0.9999999999999996 was out of Integer range.
    
    This value was very close to 1 however, and mathematically the values sum to 1: 0.764553836902861 + 0.0701367068431045 + 0.165309456254034, but due to IEEE floating point representation we can run into such issues.
    If we do insist on breaking existing users by removing the rounding, we should at least allow them an escape hatch to specify the precision.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103559188
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @datumbox Sorry the link doesn't seem to work for me.  If you could add it to the ALSSuite that would be great.  Oh no, I don't mean to have a different check, just to have a clearer exception message - something like "throw new IllegalArgumentException(s"ALS only supports values in Integer range and without fractional part for columns ${$(userCol)} and ${$(itemCol)}. Value $n was either out of Integer range or contained a fractional part that could not be converted.") "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen Yes, the behaviour of the method remains the same. This patch helped me get a measurable improvement on GC overhead in Spark 2.0, so I though that it would be beneficial for others. Anyway thanks for the comments. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen @mlnick I updated the PR based on what was discussed above and I tested it again on Spark 2.1. I also updated the coding styles and the exception message. 
    
    The changes requested by Sean, should not affect the speed of the proposed solution but rather help us catch all the corner cases. I tested this by generating a fresh batch of 1000 samples using the latest implementation ([details here](http://pastebin.com/vbbX1c7p)). The mean of the sample is 4.53837 which is no different than the mean of the code that I committed yesterday (p-value 0.4265 - H0 of equal means can't be rejected). So we are good.
    
    Could any of the admins help re-test this because the "Jenkins, retest this please" did not work for me earlier?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen The following snippet handles explicitly Longs. It can be rewritten to remove duplicate code by introducing bools for overflow detection but I don't think it is worth it. In theory you can catch also explicitly other types such as Byte and Short but I think that's an overkill.
    
    As far as I saw, all SQL numerical types inherit from Number so comparing their doubleValue with their intValue would be enough to check if they are within integer range. 
    
    ```scala
        val u = udf { (n: Any) =>
          n match {
            case v: Int => v
            case v: Long =>
              val intV = v.intValue
              if (v == intV) {
                intV
              }
              else {
                throw new IllegalArgumentException("out of range")
              }
            //case v: Byte => v.toInt 
            //case v: Short => v.toInt
            case v: Number =>
              val intV = v.intValue
              if (v.doubleValue == intV) {
                intV
              }
              else {
                throw new IllegalArgumentException("out of range")
              }
            case _ => throw new IllegalArgumentException("invalid type")
          }
        }
    ```
    
    Personally, I would remove the explicit Long case as it introduces duplicate code and does not help match. The remaining snippet avoids doing any casting if the ID is integer (which should be the majority of cases and yields the biggest memory/speed gains) or non-numeric and handles all corner cases (All scala/java numeric types + SQL Numerics). Agree?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox left a few minor comments to clean up. Overall looks good. I think it is fine for 2.2.
    
    For the regParam even though it is a minor thing it would be best to just create a new JIRA & PR for it separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73706/testReport)** for PR 17059 at commit [`361d9cb`](https://github.com/apache/spark/commit/361d9cb30fe87a2bd21a2476a7b549f14878df3c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103557982
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft I already tested it, check this snippet posted [here](https://github.com/apache/spark/pull/17059#discussion_r103554415). I can add it in ALSSuite if necessary. The exception message is technically correct since a number with a fractional part is not in Integer range (plus you get to see the actual value of the number). Returning a different exception would require having two separate if checks instead of one. Do we really want that?
    
    Guys to be honest this PR solves a very simple thing and I did not anticipate to be so controvercial. May I suggest we agree on the final set of changes and merge it? Perhaps we can tackle any other concerns on new pull-requests? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73585/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103550608
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @imatiach-msft I am aware of the implications of floating point precision and I understand your concerns. 
    
    Having said that though, even allowing user and item Ids to be double/float is not a good idea. We just keep it for backwards compatibility I guess. Also note that the current implementation of Spark 2.1 will actually take that your 0.9999999999999996 value and silently cast it to Int (so it becomes 0)! For me the only permitted types should have been Integer, Long and BigIntegers.
    
    I don't have strong opinions about refactoring anything in the Number case as it simply performance-wise it does not matter. The point of this PR is to optimize the general case where the id is Int because the casting of the current approach algenerates twice as much data as the original dataset (of course it is GCed by at a cost). 
    
    @MLnick @srowen It's your call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick Yeap! I have run into problems with the original implementation when dealing with several billions records. This is when removing casting starts paying off. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73697/testReport)** for PR 17059 at commit [`0b49a50`](https://github.com/apache/spark/commit/0b49a50fd071b65b0887f029a04b0a6f33a684a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @MLnick I accepted all proposed changes. Please retest. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Ok, let me take a look at this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regression tests [here](https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing) - admittedly vs `1.6.1` at that time


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73585/testReport)** for PR 17059 at commit [`fc3acfb`](https://github.com/apache/spark/commit/fc3acfb1cd8e02f714425b79e5ef7cca4a529e26).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen: Thanks for the comments. We are getting there. :)
    
    I will handle the Long case as you suggest. 
    
    If you think people use SQL decimal types, I can include them at the end of the pattern matching. This will lead to some duplicate code though cause I need to write the same if statement twice. Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox I like the changes, I just had a minor concern about the code where we call v.intValue and then compare this to v.doubleValue -- due to precision issues, I'm not sure if this is desirable, since the data could come from any source and be slightly modified outside the Int range, and the previous code does not do this check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: Removed unnecessary castings and refactored check...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103055703
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        if (v == intV) { // True for Byte/Short, Long within the Int range and Double/Float with no fractional part.
    --- End diff --
    
    One could write this differently and explicitly handle all the permitted types. Unfortunately this would lead to duplicate code. Instead what I do here is convert the number into Integer and compare it with the original Number. If the values are identical one of the following is true:
    - The value is Byte or Short.
    - The value is Long but within the Integer range.
    - The value is Double or Float but without any fractional part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103548602
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    Also, if we for some reason do decide to go with the check against the fractional part, which I would not recommend, we should add a test to verify the error message is thrown, and possible add this change in accepted input format to the documentation, since it may break some users (they may have to do additional conversion to int for their values prior to calling this API if they run into such floating point precision issues).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103675400
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
    --- End diff --
    
    Should probably add to comment "... out of integer range or contains a fractional part"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    I decided to provide a few more bencharks in order to alleviate some of the concerns raised by @srowen. 
    
    To reproduce the results add the following snippet in the ALSSuite class:
    ```scala
      test("Speed difference") {
        val (training, test) =
          genExplicitTestData(numUsers = 200, numItems = 400, rank = 2, noiseStd = 0.01)
    
        val runs = 1000
        var totalTime = 0.0
        println("Performing "+runs+" runs")
        println("Run Id,Time (secs)")
        for(i <- 0 until runs) {
          val t0 = System.currentTimeMillis
          testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01, targetRMSE = 0.1)
          val secs = (System.currentTimeMillis - t0)/1000.0
          println(i+","+secs)
          totalTime += secs
        }
        println("AVG Execution Time: "+(totalTime/runs)+" secs")
      }
    ```
    To test both solutions, I collected 1000 samples for each (took ~1 hour for each). Here you can see the detailed output for the [original](http://pastebin.com/ys9Vejs9) and the [proposed](http://pastebin.com/dCpkyMGc) code. 
    
    | Code | Mean Execution Time | Std |
    | --- | --- | --- |
    | Original | 4.75521 | 0.81237 |
    | Proposed | 4.56276 | 0.72790 |
    
    Using an unpaired t-test to compare the two means we find that the proposed code is faster and the result is statistically significant (p-value < .0001). 
    
    Below I summarize why I believe the original code needs to change:
    1. Casting user and item ids into double and then to integer is a hacky & indirect way to validate that the ids are numerical and within integer range. The proposed code covers all the corner cases in a clear and direct way. As an added bonus, it handles Doubles and Floats with fractional part.
    2. Given that the ALS implementation requires ids with int values, it is expected that the majority of users encode their Ids as Integer. The proposed solution avoids any casting in that case while reducing the casting in all the other cases. This avoids putting unnecessary strain on the garbage collector, something that you can observe if you profile the execution on a large dataset.
    3. The proposed solution is not slower than the original; if something it is slightly faster.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #3582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3582/testReport)** for PR 17059 at commit [`98d6481`](https://github.com/apache/spark/commit/98d6481cc565b8807b6b953771cd8ac41d0c923f).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73693 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73693/testReport)** for PR 17059 at commit [`a1e32aa`](https://github.com/apache/spark/commit/a1e32aa3b600841118060cdb3a299b6569438816).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Yeah, Scala Long matches. Here is the "stand-alone" script that I used to confirm that everything works ok (tested on Spark 2.1):
    ```scala
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions._
    
    val u = udf { (n: Any) =>
      n match {
        case v: Int => v
        case v: Number =>
          val intV = v.intValue
          if (v.doubleValue == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        case _ => throw new IllegalArgumentException("invalid type")
      }
    }
    
    val df = sqlContext.range(10)
      .withColumn("int_success", lit(123))
      .withColumn("long_success", lit(1231L))
      .withColumn("long_fail", lit(1231000000000L))
      .withColumn("decimal_success", lit(123).cast(DecimalType(5, 2)))
      .withColumn("decimal_fail", lit(123.1).cast(DecimalType(5, 2)))
      .withColumn("double_success", lit(123.0))
      .withColumn("double_fail", lit(123.1))
      .withColumn("double_fail2", lit(1231000000000.0))
      .withColumn("string_fail", lit("123.1"))
    
    // these work fine
    df.select(u(df.col("int_success"))).show
    df.select(u(df.col("long_success"))).show
    df.select(u(df.col("decimal_success"))).show
    df.select(u(df.col("double_success"))).show
    
    // these fail with out of int range exception
    df.select(u(df.col("long_fail"))).show
    df.select(u(df.col("decimal_fail"))).show
    df.select(u(df.col("double_fail"))).show
    df.select(u(df.col("double_fail2"))).show
    
    // this fails with invalid type exception
    df.select(u(df.col("string_fail"))).show
    ```
    Cool, I'll commit the changes tonight so that @mlnick can check the final code.
    
    @srowen I really appreciate your input; you did raise good points and especially for handling the SQL datatypes. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103674659
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -205,6 +206,69 @@ class ALSSuite
         assert(decompressed.toSet === expected)
       }
     
    +  test("CheckedCast") {
    +    val checkedCast = new ALS().checkedCast
    +    val df = spark.sqlContext.range(1)
    +
    +    withClue("Valid Integer Ids") {
    +      df.select(checkedCast( lit(123) )).collect()
    +    }
    +
    +    withClue("Valid Long Ids") {
    +      df.select(checkedCast( lit(1231L) )).collect()
    +    }
    +
    +    withClue("Valid Decimal Ids") {
    +      df.select(checkedCast( lit(123).cast(DecimalType(15, 2)) )).collect()
    +    }
    +
    +    withClue("Valid Double Ids") {
    +      df.select(checkedCast( lit(123.0) )).collect()
    +    }
    +
    +    withClue("Invalid Long: out of range") {
    +      val e: SparkException = intercept[SparkException] {
    +        df.select(checkedCast( lit(1231000000000L) )).collect()
    +      }
    +      assert(e.getMessage.contains("either out of Integer range or contained a fractional part"))
    --- End diff --
    
    May as well factor this string out to a `val`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103674813
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -205,6 +206,69 @@ class ALSSuite
         assert(decompressed.toSet === expected)
       }
     
    +  test("CheckedCast") {
    +    val checkedCast = new ALS().checkedCast
    +    val df = spark.sqlContext.range(1)
    +
    +    withClue("Valid Integer Ids") {
    +      df.select(checkedCast( lit(123) )).collect()
    --- End diff --
    
    Could you remove the spaces around `lit(123)` here and throughout for code style?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73706/testReport)** for PR 17059 at commit [`361d9cb`](https://github.com/apache/spark/commit/361d9cb30fe87a2bd21a2476a7b549f14878df3c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox once a PR is ok'ed to test from a committer, it should automatically retest on each commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Ignore my comment about duplicate code. It can be written to avoid it. I will investigate handling the SQL decimal types as you recommended and I will update the code tonight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73697/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103554415
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    @datumbox I agree that calling toInt in the previous code was not the best decision either; if we really wanted to support double type correctly we would have probably done the check with some precision.  Maybe if  we add to the documentation that double types for user/item should be avoided I would be ok with it.
    Also, could you add a test case to verify the exception is thrown for doubles with a fraction, similar to the test case I sent you (with a catch on the exception type and withClue on the message)?  We should clean up the error message too, since it is thrown not only when the value is not in Integer range (Int.Max/Min) but also when there is a fractional part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73693 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73693/testReport)** for PR 17059 at commit [`a1e32aa`](https://github.com/apache/spark/commit/a1e32aa3b600841118060cdb3a299b6569438816).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    I updated the exception message and I added Unit-tests to cover the checkedCast method as @imatiach-msft suggested. I have not corrected the value of regParam but I had a look and it is safe to do so. 
    
    Thanks everyone for their comments and recommendations. Please consider merging this PR with master because many people are working lately on the ALS and soon conflicts will arise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103675924
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -451,8 +461,8 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
     
         val r = if ($(ratingCol) != "") col($(ratingCol)).cast(FloatType) else lit(1.0f)
         val ratings = dataset
    -      .select(checkedCast(col($(userCol)).cast(DoubleType)),
    -        checkedCast(col($(itemCol)).cast(DoubleType)), r)
    +      .select(checkedCast(col($(userCol))),
    +        checkedCast(col($(itemCol))), r)
    --- End diff --
    
    Nit - but can this now go on one line if short enough?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @imatiach-msft This comparison is intentional and checks 2 things: That the number is within integer range and that the Id does not have any non-zero digits after the decimal point. If the number is outside integer range the intValue will overflow and the number will not match its double part. Moreover if it has fractional part it will also will not match the integer.
    
    Effectively we are making sure that the UserId and ItemId have values within integer range.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73718/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox thanks for adding the tests, I've approved the review, assuming the other reviewers are ok with minimal double support.  I haven't seen withClue used in the same way you've used it, but it seems fine to me (usually the withClue would contain the error message, and you wouldn't have the extra assert).  Thank you for the nice performance improvement!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    **[Test build #73620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73620/testReport)** for PR 17059 at commit [`eb33189`](https://github.com/apache/spark/commit/eb331897e4a70c2d850bdfe6ef7bf9c1acb50653).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103677123
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -513,31 +577,31 @@ class ALSSuite
         withClue("fit should fail when ids exceed integer range. ") {
           assert(intercept[SparkException] {
             als.fit(df.select(df("user_big").as("user"), df("item"), df("rating")))
    -      }.getCause.getMessage.contains("was out of Integer range"))
    +      }.getCause.getMessage.contains("out of Integer range or contained a fractional part"))
    --- End diff --
    
    May as well clean this message up into a `val` also while we're at it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by imatiach-msft <gi...@git.apache.org>.
Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @datumbox ah ok, thanks for the link.  LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on the issue:

    https://github.com/apache/spark/pull/17059
  
    @srowen @MLnick I had a typo on the committed version... Could you please retest it? I don't think I have the rights to do it myself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103422515
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    I think the test does work, but may be slightly more direct to check if `v.doubleValue % 1.0 == 0.0`? this also works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by datumbox <gi...@git.apache.org>.
Github user datumbox commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103427764
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    The way it is written it checks 2 things at the same time. 1) that it is within integer range (so if it overflows the equality will not hold) and 2) that it has no fractional part. The modulo check only covers you for point #2 but not #1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17059: [SPARK-19733][ML]Removed unnecessary castings and...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17059#discussion_r103421071
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -82,12 +82,20 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
        * Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
        * out of integer range.
        */
    -  protected val checkedCast = udf { (n: Double) =>
    -    if (n > Int.MaxValue || n < Int.MinValue) {
    -      throw new IllegalArgumentException(s"ALS only supports values in Integer range for columns " +
    -        s"${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
    -    } else {
    -      n.toInt
    +  protected val checkedCast = udf { (n: Any) =>
    +    n match {
    +      case v: Int => v // Avoid unnecessary casting
    +      case v: Number =>
    +        val intV = v.intValue()
    +        // Checks if number within Int range and has no fractional part.
    +        if (v.doubleValue == intV) {
    --- End diff --
    
    Yeah the reality is we mostly expect Int or Long ids, but want to support any numeric ID as long as it falls within Int range - so passing in fractional float/double just doesn't make sense and we'll throw an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org