You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/05/03 19:18:58 UTC

[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/17845

    [SPARK-20587][ML] Improve performance of ML ALS recommendForAll

    This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods.
    
    ## How was this patch tested?
    
    Existing unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark ml-als-perf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17845
    
----
commit 4fd11b9eb5c21939adf1d104ea2d80de25af1542
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-03T09:36:28Z

    First cut

commit 29d67774c88dd5c86ff8c5669b0328faa842239a
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-03T10:17:00Z

    Fix increment error

commit baeadd0403434a12d3868c75c1496b562f560d59
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-03T19:05:25Z

    Move PQ outside of foreach and update comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114706758
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,45 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    +        val pq = new BoundedPriorityQueue[(Int, Float)](num)(Ordering.by(_._2))
    +        srcIter.foreach { case (srcId, srcFactor) =>
    +          dstIter.foreach { case (dstId, dstFactor) =>
    +            /**
    --- End diff --
    
    Sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Some quick perf numbers:
    
    Using `ml-latest` dataset (~24 million ratings, ~260k users, ~39k movies); 4x workers 48 cores 100GB RAM each.
    
    `rank = 10, k = 10`
    
    master | this PR
    ------------ | -------------
    `recommendForAllUsers`
    369s | 16s
    `recommendForAllItems`
    547s | 15s
    
    So 23-37x improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r115421499
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,43 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    --- End diff --
    
    True


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Merged to master/branch-2.2
    
    Thanks @mpjlu for the original work on the approach!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    cc @mpjlu 
    
    Also @srowen @sethah @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r115349492
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,43 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    --- End diff --
    
    Nit: You could combine j and i; you really just need 1 counter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76424/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Thanks @sethah will update shortly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114706857
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,45 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    +        val pq = new BoundedPriorityQueue[(Int, Float)](num)(Ordering.by(_._2))
    +        srcIter.foreach { case (srcId, srcFactor) =>
    +          dstIter.foreach { case (dstId, dstFactor) =>
    +            /**
    +             * The below code is equivalent to
    +             * val score = blas.sdot(rank, srcFactor, 1, dstFactor, 1)
    +             * Compared with BLAS.dot, the hand-written version used below is more efficient than
    +             * a call to the native BLAS backend and the same performance as the fallback
    +             * F2jBLAS backend.
    +             */
    +            var score = 0.0f
    +            var k = 0
    +            while (k < rank) {
    +              score += srcFactor(k) * dstFactor(k)
    +              k += 1
    +            }
    +            pq += { (dstId, score) }
    --- End diff --
    
    sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17845


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76571/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114706875
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -389,6 +436,17 @@ class ALSModel private[ml] (
         )
         recs.select($"id" as srcOutputColumn, $"recommendations" cast arrayType)
    --- End diff --
    
    Fair point - may as well fix it while here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    **[Test build #76424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76424/testReport)** for PR 17845 at commit [`baeadd0`](https://github.com/apache/spark/commit/baeadd0403434a12d3868c75c1496b562f560d59).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76447/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114661114
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,45 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    +        val pq = new BoundedPriorityQueue[(Int, Float)](num)(Ordering.by(_._2))
    +        srcIter.foreach { case (srcId, srcFactor) =>
    +          dstIter.foreach { case (dstId, dstFactor) =>
    +            /**
    --- End diff --
    
    don't use doc notation. Maybe we can reduce it to:
    
    ````scala
                /*
                 * The below code is equivalent to
                 *   `val score = blas.sdot(rank, srcFactor, 1, dstFactor, 1)`
                 * The handwritten version is as or more efficient as BLAS calls in this case. 
                 */
    ````


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114706892
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -389,6 +436,17 @@ class ALSModel private[ml] (
         )
         recs.select($"id" as srcOutputColumn, $"recommendations" cast arrayType)
       }
    +
    +  /**
    +   * Blockifies factors to improve the efficiency of cross join
    +   */
    +  private def blockify(
    +    factors: Dataset[(Int, Array[Float])],
    +    /* TODO make blockSize a param? */blockSize: Int = 4096): Dataset[Seq[(Int, Array[Float])]] = {
    --- End diff --
    
    sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    **[Test build #76447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76447/testReport)** for PR 17845 at commit [`cf35eea`](https://github.com/apache/spark/commit/cf35eead9ce0d4832a0df5de22722b47e4017bf7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r115425530
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,43 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    --- End diff --
    
    Anyway the `iter.next()` code is a bit ugly and since it's at most `k` elements it's not really performance critical, so could just use `foreach` I think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114658306
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,45 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    +        val pq = new BoundedPriorityQueue[(Int, Float)](num)(Ordering.by(_._2))
    +        srcIter.foreach { case (srcId, srcFactor) =>
    +          dstIter.foreach { case (dstId, dstFactor) =>
    +            /**
    +             * The below code is equivalent to
    +             * val score = blas.sdot(rank, srcFactor, 1, dstFactor, 1)
    +             * Compared with BLAS.dot, the hand-written version used below is more efficient than
    +             * a call to the native BLAS backend and the same performance as the fallback
    +             * F2jBLAS backend.
    +             */
    +            var score = 0.0f
    +            var k = 0
    +            while (k < rank) {
    +              score += srcFactor(k) * dstFactor(k)
    +              k += 1
    +            }
    +            pq += { (dstId, score) }
    --- End diff --
    
    `pq += dstId -> score`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114706687
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +356,19 @@ class ALSModel private[ml] (
     
       /**
        * Makes recommendations for all users (or items).
    +   *
    +   * Note: the previous approach used for computing top-k recommendations
    +   * used a cross-join followed by predicting a score for each row of the joined dataset.
    +   * However, this results in exploding the size of intermediate data. While Spark SQL makes it
    +   * relatively efficient, the approach implemented here is significantly more efficient.
    +   *
    +   * This approach groups factors into blocks and computes the top-k elements per block,
    +   * using Level 1 BLAS (dot) and an efficient [[BoundedPriorityQueue]]. It then computes the
    --- End diff --
    
    How about "... using dot product instead of gemm and an efficient ..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114660366
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +356,19 @@ class ALSModel private[ml] (
     
       /**
        * Makes recommendations for all users (or items).
    +   *
    +   * Note: the previous approach used for computing top-k recommendations
    +   * used a cross-join followed by predicting a score for each row of the joined dataset.
    +   * However, this results in exploding the size of intermediate data. While Spark SQL makes it
    +   * relatively efficient, the approach implemented here is significantly more efficient.
    +   *
    +   * This approach groups factors into blocks and computes the top-k elements per block,
    +   * using Level 1 BLAS (dot) and an efficient [[BoundedPriorityQueue]]. It then computes the
    --- End diff --
    
    below we say that blas is _not_ used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    Actually, I do have some questions, but I see this is copied from the other PR, so I'll ask there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    One more comment I'll copy from the other PR: I'm not a fan of custom BLAS implementations scattered throughout MLlib. Could you please follow up by putting the dot as a private API in BLAS.scala and adding unit tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114660256
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -389,6 +436,17 @@ class ALSModel private[ml] (
         )
         recs.select($"id" as srcOutputColumn, $"recommendations" cast arrayType)
       }
    +
    +  /**
    +   * Blockifies factors to improve the efficiency of cross join
    +   */
    +  private def blockify(
    +    factors: Dataset[(Int, Array[Float])],
    +    /* TODO make blockSize a param? */blockSize: Int = 4096): Dataset[Seq[(Int, Array[Float])]] = {
    --- End diff --
    
    just put the comment in the doc and reference a JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    **[Test build #76424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76424/testReport)** for PR 17845 at commit [`baeadd0`](https://github.com/apache/spark/commit/baeadd0403434a12d3868c75c1496b562f560d59).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    **[Test build #76447 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76447/testReport)** for PR 17845 at commit [`cf35eea`](https://github.com/apache/spark/commit/cf35eead9ce0d4832a0df5de22722b47e4017bf7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r115424735
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -372,11 +385,43 @@ class ALSModel private[ml] (
           num: Int): DataFrame = {
         import srcFactors.sparkSession.implicits._
     
    -    val ratings = srcFactors.crossJoin(dstFactors)
    -      .select(
    -        srcFactors("id"),
    -        dstFactors("id"),
    -        predict(srcFactors("features"), dstFactors("features")))
    +    val srcFactorsBlocked = blockify(srcFactors.as[(Int, Array[Float])])
    +    val dstFactorsBlocked = blockify(dstFactors.as[(Int, Array[Float])])
    +    val ratings = srcFactorsBlocked.crossJoin(dstFactorsBlocked)
    +      .as[(Seq[(Int, Array[Float])], Seq[(Int, Array[Float])])]
    +      .flatMap { case (srcIter, dstIter) =>
    +        val m = srcIter.size
    +        val n = math.min(dstIter.size, num)
    +        val output = new Array[(Int, Int, Float)](m * n)
    +        var j = 0
    --- End diff --
    
    `j` iterates through `src` ids while `i` iterates through `dst` ids in the queue for each `src` id. So I don't think they can be combined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17845: [SPARK-20587][ML] Improve performance of ML ALS r...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17845#discussion_r114660148
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -389,6 +436,17 @@ class ALSModel private[ml] (
         )
         recs.select($"id" as srcOutputColumn, $"recommendations" cast arrayType)
    --- End diff --
    
    This is discouraged within Spark: https://github.com/databricks/scala-style-guide#infix-methods


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    **[Test build #76571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76571/testReport)** for PR 17845 at commit [`cf35eea`](https://github.com/apache/spark/commit/cf35eead9ce0d4832a0df5de22722b47e4017bf7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17845: [SPARK-20587][ML] Improve performance of ML ALS recommen...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17845
  
    **[Test build #76571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76571/testReport)** for PR 17845 at commit [`cf35eea`](https://github.com/apache/spark/commit/cf35eead9ce0d4832a0df5de22722b47e4017bf7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org