You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2016/04/28 20:58:56 UTC

[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/12762

    [SPARK-14891][ML] Add schema validation for ALS

    This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.
    
    With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).
    
    ## How was this patch tested?
    New test cases in `ALSSuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-14891-als-validate-schema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12762.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12762
    
----
commit c21d5a0afd756a023c8f1e85018f9438cdff3ad6
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-04-28T11:49:16Z

    Intial work

commit e5860cd08a18b93abecd9bbc786316517629ec91
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-04-28T13:47:45Z

    Improve tests and add transform test cases + small cleanup

commit c5f3e1a095dd80add219848f92d3c2b7667fbaa3
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-04-28T14:12:54Z

    revert small erroneous addition

commit 73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-04-28T18:47:41Z

    user/item col param docs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218675488
  
    **[Test build #58455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58455/consoleFull)** for PR 12762 at commit [`150321f`](https://github.com/apache/spark/commit/150321f019eab8b26ee411d28d51847733765c34).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61494019
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61487974
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    Ah yes, I didn't notice the first cast from input type to Long - it seems like that would be OK[ish] most of the time (except with floats/doubles), but also with certain BigDecimal you could end up throwing away the high bits when going to a Long and a very out of range value would pass the range check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215528792
  
    cc @sethah @BenFradet @srowen @jkbradley @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-220017229
  
    **[Test build #58776 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58776/consoleFull)** for PR 12762 at commit [`1fd1f87`](https://github.com/apache/spark/commit/1fd1f873f0926cace39c6ee16a830c16710bc6d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61489680
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    Ah ok. Could cast to double or float here... I was just concerned about any
    storage / performance impact, but if everything is pipelines through the
    cast -> udf then no problem
    On Thu, 28 Apr 2016 at 21:27, Holden Karau <no...@github.com> wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12762#discussion_r61487974>:
    >
    > >
    > >    /** @group getParam */
    > >    def getUserCol: String = $(userCol)
    > >
    > >    /**
    > > -   * Param for the column name for item ids.
    > > +   * Param for the column name for item ids. Ids must be integers. Other
    > > +   * numeric types are supported for this column, but will be cast to integers as long as they
    >
    > Ah yes, I didn't notice the first cast from input type to Long - it seems
    > like that would be OK[ish] most of the time (except with floats/doubles),
    > but also with certain BigDecimal you could end up throwing away the high
    > bits when going to a Long and a very out of range value would pass the
    > range check.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12762/files/73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9#r61487974>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-216869120
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57759/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61539432
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -512,6 +513,60 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("input type validation") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +
    +    val test = Seq()
    --- End diff --
    
    ah yes that's a left-over, will remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r64847897
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -430,10 +454,13 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
     
       @Since("2.0.0")
       override def fit(dataset: Dataset[_]): ALSModel = {
    +    transformSchema(dataset.schema)
         import dataset.sparkSession.implicits._
    +
         val r = if ($(ratingCol) != "") col($(ratingCol)).cast(FloatType) else lit(1.0f)
         val ratings = dataset
    -      .select(col($(userCol)).cast(IntegerType), col($(itemCol)).cast(IntegerType), r)
    +      .select(checkedCast(col($(userCol)).cast(DoubleType)),
    --- End diff --
    
    DoubleType -> IntegerType


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r64847886
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    Sorry for the slow answer.  I like supporting all Numeric types, with checking.  I agree we should support String IDs at some point, with automatic indexing; that can be part of this discussion: [https://issues.apache.org/jira/browse/SPARK-11106]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-220008852
  
    **[Test build #58776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58776/consoleFull)** for PR 12762 at commit [`1fd1f87`](https://github.com/apache/spark/commit/1fd1f873f0926cace39c6ee16a830c16710bc6d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215639037
  
    **[Test build #57319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57319/consoleFull)** for PR 12762 at commit [`fb443ef`](https://github.com/apache/spark/commit/fb443ef3a76c03ac66a6240660c9ef61581d3c1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215529657
  
    **[Test build #57267 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57267/consoleFull)** for PR 12762 at commit [`73ea0b6`](https://github.com/apache/spark/commit/73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61485103
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    So it seems like the only other numeric type we support is Long, maybe it would be better to say that? Someone might try and pass in BigInts or Doubles and expect this work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215563782
  
    LGTM except for a few minors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218570425
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58395/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r64868978
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -242,16 +263,19 @@ class ALSModel private[ml] (
           }
         }
         dataset
    -      .join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
    -      .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
    +      .join(userFactors,
    +        checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left")
    --- End diff --
    
    @jkbradley the existing code did the cast to `Int` - that means passing in `Long` or `Double` (say) would silently cast and potentially lose precison and give weird results, with no exception or warning. That's why here we cast to `DoubleType` and use the `checkedCast` `udf` to do a safe cast _if_ the value is within `Integer` range. If not we throw an exception with a helpful message.
    
    This is so we allow can any numeric type for the user/item columns (providing some form of "backward compatability" with the old version that didn't check types at all), but we can still only support actual _values_ that are `Int`s. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-220017380
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58776/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61491217
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    I think keeping cast to Integer is good for performance - but maybe just avoiding supporting input types that we might silently fail on/produce junk results for.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218682239
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58455/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61539464
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -58,6 +58,30 @@ object MLTestingUtils extends SparkFunSuite {
           "Column label must be of type NumericType but was actually of type StringType"))
       }
     
    +  def checkNumericTypesALS[M <: Model[M], T <: Estimator[M]](
    +    estimator: T,
    +    sqlContext: SQLContext,
    +    column: String,
    +    baseType: NumericType)
    +    (check: (M, M) => Unit)
    +    (check2: (M, M, DataFrame) => Unit): Unit = {
    +    val dfs = genRatingsDFWithNumericCols(sqlContext, column)
    --- End diff --
    
    thanks - IDEA keeps resetting it. will fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12762


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61539561
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -58,6 +58,30 @@ object MLTestingUtils extends SparkFunSuite {
           "Column label must be of type NumericType but was actually of type StringType"))
       }
     
    +  def checkNumericTypesALS[M <: Model[M], T <: Estimator[M]](
    --- End diff --
    
    I originally tried to adapt your generic methods to handle the ratings case, but it turned out to seem more complex than was worth it. I kept the generic types as a left-over from that attempt, but yeah it's not really necessary,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-220129049
  
    Merged to master/branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218560077
  
    **[Test build #58395 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58395/consoleFull)** for PR 12762 at commit [`f0ecde7`](https://github.com/apache/spark/commit/f0ecde75098f1b4781c9ec872ed3cf0cc9431276).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218682236
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215758720
  
    This looks good to me. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215539507
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57267/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61666720
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    ping @mengxr also in case you have a chance to take a look, and consider this question of whether to only support Int/Long for ids or support all numeric types (with "safe" cast to Int in both cases)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215646308
  
    **[Test build #57319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57319/consoleFull)** for PR 12762 at commit [`fb443ef`](https://github.com/apache/spark/commit/fb443ef3a76c03ac66a6240660c9ef61581d3c1e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-216856811
  
    @yanboliang @holdenk @BenFradet @jkbradley I went ahead and just cast user/item col to double before checked cast to support all numeric types for user/item ids.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-219389391
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58633/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-219389390
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215646454
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57319/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218571771
  
    LGTM, except for maybe the generics in `checkNumericTypesALS` but that's really minor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61502301
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -58,6 +58,30 @@ object MLTestingUtils extends SparkFunSuite {
           "Column label must be of type NumericType but was actually of type StringType"))
       }
     
    +  def checkNumericTypesALS[M <: Model[M], T <: Estimator[M]](
    --- End diff --
    
    Do we want to have the generics? This seems to be specific to als only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61485772
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    We "support" all numeric types in the sense that the input col can be any
    numeric type. But it is cast to Int. It is a "safe" cast though, if it is >
    Int.MaxValue or < Int.MinValue it throws an exception.
    
    On Thu, 28 Apr 2016 at 21:08 Holden Karau <no...@github.com> wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12762#discussion_r61485103>:
    >
    > >
    > >    /** @group getParam */
    > >    def getUserCol: String = $(userCol)
    > >
    > >    /**
    > > -   * Param for the column name for item ids.
    > > +   * Param for the column name for item ids. Ids must be integers. Other
    > > +   * numeric types are supported for this column, but will be cast to integers as long as they
    >
    > So it seems like the only other numeric type we support is Long, maybe it
    > would be better to say that? Someone might try and pass in BigInts or
    > Doubles and expect this work.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12762/files/73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9#r61485103>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-216868988
  
    **[Test build #57759 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57759/consoleFull)** for PR 12762 at commit [`70a29f5`](https://github.com/apache/spark/commit/70a29f565c9a16821f747a5b747d2d8bee99b51e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215646452
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-219389308
  
    **[Test build #58633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58633/consoleFull)** for PR 12762 at commit [`1fd1f87`](https://github.com/apache/spark/commit/1fd1f873f0926cace39c6ee16a830c16710bc6d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218570423
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215539378
  
    **[Test build #57267 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57267/consoleFull)** for PR 12762 at commit [`73ea0b6`](https://github.com/apache/spark/commit/73ea0b62f1c0ae6a9897ec83f5c8dfedea86f3f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61497201
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -512,6 +513,60 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("input type validation") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +
    +    val test = Seq()
    --- End diff --
    
    not sure this is used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-216856660
  
    **[Test build #57759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57759/consoleFull)** for PR 12762 at commit [`70a29f5`](https://github.com/apache/spark/commit/70a29f565c9a16821f747a5b747d2d8bee99b51e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r64847896
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -242,16 +263,19 @@ class ALSModel private[ml] (
           }
         }
         dataset
    -      .join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
    -      .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
    +      .join(userFactors,
    +        checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left")
    --- End diff --
    
    Shouldn't this and the next line's cast both use IntegerType?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61539327
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -53,24 +53,43 @@ import org.apache.spark.util.random.XORShiftRandom
      */
     private[recommendation] trait ALSModelParams extends Params with HasPredictionCol {
       /**
    -   * Param for the column name for user ids.
    +   * Param for the column name for user ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    +   * fall within the integer value range.
        * Default: "user"
        * @group param
        */
    -  val userCol = new Param[String](this, "userCol", "column name for user ids")
    +  val userCol = new Param[String](this, "userCol", "column name for user ids. Must be within " +
    +    "the integer value range.")
     
       /** @group getParam */
       def getUserCol: String = $(userCol)
     
       /**
    -   * Param for the column name for item ids.
    +   * Param for the column name for item ids. Ids must be integers. Other
    +   * numeric types are supported for this column, but will be cast to integers as long as they
    --- End diff --
    
    Personally as per the related JIRA, I would actually like to support Int, Long and String for ids in ALS (with appropriate warnings about performance impact for Long/String ids). For the vast majority of use cases I believe the user-friendliness of supporting String in particular outweighs the performance impact. For those users who need performance at scale, they can stick to Int.
    
    But for now, since only Int ids are supported in the DF API, some validation is better than nothing. I am actually slightly more in favor of only supporting Int or Long for the id columns in this PR, since the real-world occurrence of a Double or other more esoteric numeric type for the id column is, IMO, highly unlikely, and in that case requiring users to do the cast explicitly themselves is ok I would say.
    
    So we can support Longs (within Integer range) as a simpler alternative here - it would just require to update the type checks in `transformSchema` and the tests. 
    
    @jkbradley @srowen @holdenk thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-215539505
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r61501824
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
    @@ -58,6 +58,30 @@ object MLTestingUtils extends SparkFunSuite {
           "Column label must be of type NumericType but was actually of type StringType"))
       }
     
    +  def checkNumericTypesALS[M <: Model[M], T <: Estimator[M]](
    +    estimator: T,
    +    sqlContext: SQLContext,
    +    column: String,
    +    baseType: NumericType)
    +    (check: (M, M) => Unit)
    +    (check2: (M, M, DataFrame) => Unit): Unit = {
    +    val dfs = genRatingsDFWithNumericCols(sqlContext, column)
    --- End diff --
    
    the indent seems off for the params


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12762: [SPARK-14891][ML] Add schema validation for ALS

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12762#discussion_r103352114
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -242,16 +263,19 @@ class ALSModel private[ml] (
           }
         }
         dataset
    -      .join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
    -      .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
    +      .join(userFactors,
    +        checkedCast(dataset($(userCol)).cast(DoubleType)) === userFactors("id"), "left")
    --- End diff --
    
    Right, makes sense, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-219382438
  
    **[Test build #58633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58633/consoleFull)** for PR 12762 at commit [`1fd1f87`](https://github.com/apache/spark/commit/1fd1f873f0926cace39c6ee16a830c16710bc6d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218682139
  
    **[Test build #58455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58455/consoleFull)** for PR 12762 at commit [`150321f`](https://github.com/apache/spark/commit/150321f019eab8b26ee411d28d51847733765c34).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-218570242
  
    **[Test build #58395 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58395/consoleFull)** for PR 12762 at commit [`f0ecde7`](https://github.com/apache/spark/commit/f0ecde75098f1b4781c9ec872ed3cf0cc9431276).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-220017377
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-216869116
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14891][ML] Add schema validation for AL...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12762#issuecomment-220008113
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org