You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by gaborhermann <gi...@git.apache.org> on 2016/09/23 11:11:24 UTC

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

GitHub user gaborhermann opened a pull request:

    https://github.com/apache/flink/pull/2542

    [FLINK-4613] Extend ALS to handle implicit feedback datasets

    This extension of the ALS algorithm changes some parts of the code if `implicitPrefs` flag is set to true. Mainly the local parts parts are changed: the `Xt * X` computation takes into consideration the confidence, thus computing `Xt * (C - I) * X` instead (see the paper by Hu et al. for details). The `Xt * X` matrix is precomputed and broadcasted, and that is the only thing that affects distributed execution.
    
    Note, that we use a temporary directory in the test, because there would not be enough memory segments to perform a hash join for prediction. I assume that memory segments are not freed up after the training if no temporary directory is set, but I did not investigate the issue as using a tempdir is a simple workaround.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gaborhermann/flink ials

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2542.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2542
    
----
commit 84d338b11f77b20fa1825029f8ca847a40eb4673
Author: G�bor Hermann <co...@gaborhermann.com>
Date:   2016-09-12T09:47:40Z

    [FLINK-4613] Compute XtX for IALS & test, docs

commit 8e7c0d67a6f0390f03765fcdc9e03f3c391807cd
Author: jfeher <fe...@gmail.com>
Date:   2016-09-12T09:57:44Z

    [FLINK-4613] Extend ALS for implicit case
    
    XtX matrix precomputation is not yet done.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    I did some minor changes according to user suggestions, and put it in a separate commit (Minor fixes in IALS), so you can see my fixes. Note that before merging I'd prefer to squash the commit with the previous one in order to have proper commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80253321
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -49,6 +49,18 @@ By applying this step alternately to the matrices $U$ and $V$, we can iterativel
     
     The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
     
    +An alternative model can be used for _implicit feedback_ datasets.
    +These datasets only contain implicit feedback from the user
    +in contrast to datasets with explicit feedback like movie ratings.
    +For example users watch videos on a website and the website monitors which user
    +viewed which video, so the users only provide their preference implicitly.
    +In these cases the feedback should not be treated as a
    +rating, but rather an evidence that the user prefers that item.
    +Thus, for implicit feedback datasets there is a slightly different
    +minimalization problem to solve (see [Hu et al.](http://dx.doi.org/10.1109/ICDM.2008.22) for details).
    +Flink supports both explicit and implicit ALS,
    +and the choice between the two can be set in the parameters.
    +
    --- End diff --
    
    Okay, I added
    "The implementation is based on the Apache Spark implementation of implicit ALS."
    and referred to the relevant file in the Spark codebase.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by zentol <gi...@git.apache.org>.
Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80229781
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -49,6 +49,18 @@ By applying this step alternately to the matrices $U$ and $V$, we can iterativel
     
     The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
     
    +An alternative model can be used for _implicit feedback_ datasets.
    +These datasets only contain implicit feedback from the user
    +in constrast to datasets with explicit feedback like movie ratings.
    --- End diff --
    
    constrast -> contrast


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    For tests it would be good to fix the random seed at least. Otherwise we might not have stable tests depending on the seed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80230683
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -49,6 +49,18 @@ By applying this step alternately to the matrices $U$ and $V$, we can iterativel
     
     The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
     
    +An alternative model can be used for _implicit feedback_ datasets.
    +These datasets only contain implicit feedback from the user
    +in constrast to datasets with explicit feedback like movie ratings.
    --- End diff --
    
    Fixed. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Yes I think you're right. We shouldn't fix the seed per default. I think you can include this change in this PR since it's a minor change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81159171
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -581,6 +637,16 @@ object ALS {
             val userXy = new ArrayBuffer[Array[Double]]()
             val numRatings = new ArrayBuffer[Int]()
     
    +        var precomputedXtX: Array[Double] = null
    +
    +        override def open(config: Configuration): Unit = {
    +          // retrieve broadcasted precomputed XtX if using implicit feedback
    +          if (implicitPrefs) {
    +            precomputedXtX = getRuntimeContext.getBroadcastVariable[Array[Double]]("XtX")
    +              .iterator().next()
    +          }
    +        }
    +
             override def coGroup(left: lang.Iterable[(Int, Int, Array[Array[Double]])],
    --- End diff --
    
    We can ping @tillrohrmann here, as the original author maybe he has some input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81112771
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -99,6 +114,26 @@ The alternating least squares implementation can be controlled by the following
             </td>
           </tr>
           <tr>
    +        <td><strong>ImplicitPrefs</strong></td>
    +        <td>
    +          <p>
    +            Implicit property of the observations, meaning that they do not represent an explicit
    +            preference of the user, just the implicit information how many times the user consumed the
    +            (Default value: <strong>false</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Alpha</strong></td>
    +        <td>
    +          <p>
    +            Weight of the positive implicit observations. Should be non-negative.
    +            Only relevant when ImplicitPrefs is set to true.
    +            (Default value: <strong>1</strong>)
    --- End diff --
    
    I guess it was left as default 1 for no reason, I don't have any motivation for it. I changed it to 40 according to the paper. Thanks for the suggestion!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81112516
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -49,6 +49,21 @@ By applying this step alternately to the matrices $U$ and $V$, we can iterativel
     
     The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
     
    +An alternative model can be used for _implicit feedback_ datasets.
    +These datasets only contain implicit feedback from the user
    +in contrast to datasets with explicit feedback like movie ratings.
    +For example users watch videos on a website and the website monitors which user
    +viewed which video, so the users only provide their preference implicitly.
    +In these cases the feedback should not be treated as a
    +rating, but rather an evidence that the user prefers that item.
    +Thus, for implicit feedback datasets there is a slightly different
    +minimalization problem to solve (see [Hu et al.](http://dx.doi.org/10.1109/ICDM.2008.22) for details).
    --- End diff --
    
    Thanks. Changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87588061
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -675,7 +756,69 @@ object ALS {
               collector.collect((blockID, array))
             }
           }
    -    }.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +    }
    +
    +    // broadcasting XtX matrix in the implicit case
    +    val updatedFactorMatrix = if (implicitPrefs) {
    +      newMatrix.withBroadcastSet(XtXtoBroadcast.get, "XtX")
    +    } else {
    +      newMatrix
    +    }
    +
    +    updatedFactorMatrix.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +  }
    +
    +  /**
    +    * Computes the XtX matrix for the implicit version before updating the factors.
    +    * This matrix is intended to be broadcast, but as we cannot use a sink inside a Flink
    +    * iteration, so we represent it as a [[DataSet]] with a single element containing the matrix.
    +    *
    +    * The algorithm computes `X_i^T * X_i` for every block `X_i` of `X`,
    +    * then sums all these computed matrices to get `X^T * X`.
    +    */
    +  private[recommendation] def computeXtX(x: DataSet[(Int, Array[Array[Double]])], factors: Int):
    +  DataSet[Array[Double]] = {
    +    val triangleSize = factors * (factors - 1) / 2 + factors
    +
    +    type MtxBlock = (Int, Array[Array[Double]])
    +    // construct XtX for all blocks
    +    val xtx = x
    +      .mapPartition(new MapPartitionFunction[MtxBlock, Array[Double]]() {
    +        var xtxForBlock: Array[Double] = null
    +
    +        override def mapPartition(blocks: Iterable[(Int, Array[Array[Double]])],
    +                                  out: Collector[Array[Double]]): Unit = {
    +
    +          if (xtxForBlock == null) {
    +            // creating the matrix if not yet created
    +            xtxForBlock = Array.fill(triangleSize)(0.0)
    +          } else {
    +            // erasing the matrix
    +            var i = 0
    +            while (i < xtxForBlock.length) {
    --- End diff --
    
    Okay, you're right. I modified the code accordingly.
    
    In this case the decision seems straightforward, but when it's not that clear, I agree we should do profiling. Then we could see if we could make the profiling easy to use. For now let's just keep this in mind.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by mbalassi <gi...@git.apache.org>.
Github user mbalassi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80248596
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -49,6 +49,18 @@ By applying this step alternately to the matrices $U$ and $V$, we can iterativel
     
     The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
     
    +An alternative model can be used for _implicit feedback_ datasets.
    +These datasets only contain implicit feedback from the user
    +in contrast to datasets with explicit feedback like movie ratings.
    +For example users watch videos on a website and the website monitors which user
    +viewed which video, so the users only provide their preference implicitly.
    +In these cases the feedback should not be treated as a
    +rating, but rather an evidence that the user prefers that item.
    +Thus, for implicit feedback datasets there is a slightly different
    +minimalization problem to solve (see [Hu et al.](http://dx.doi.org/10.1109/ICDM.2008.22) for details).
    +Flink supports both explicit and implicit ALS,
    +and the choice between the two can be set in the parameters.
    +
    --- End diff --
    
    Let us mention, that the implementation was motivated by the Spark implicit ALS implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81115229
  
    --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/recommendation/ImplicitALSTest.scala ---
    @@ -0,0 +1,171 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.ml.recommendation
    +
    +import org.apache.flink.ml.util.FlinkTestBase
    +import org.scalatest._
    +
    +import scala.language.postfixOps
    +import org.apache.flink.api.scala._
    +import org.apache.flink.core.testutils.CommonTestUtils
    +
    +class ImplicitALSTest
    +  extends FlatSpec
    +    with Matchers
    +    with FlinkTestBase {
    +
    +  override val parallelism = 2
    +
    +  behavior of "The modification of the alternating least squares (ALS) implementation" +
    +    "for implicit feedback datasets."
    +
    +  it should "properly compute Y^T * Y, and factorize matrix" in {
    --- End diff --
    
    Yes, you are right it should be split into two tests.
    I kept it in one only because, if I remembered correctly, there was a rule not to initialize to `ExecutionEnvironment`s in one test (probably because of taking test performance into consideration). I did not see, however, how to initialize the environment once, as I would have had to override the `before` function implemented in `FlinkTestBase`.
    
    What do you think would be the best solution? Overriding `before`, using two different `ExectionEnvironment`s, or keeping it this way (in one test)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87587473
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -273,6 +308,14 @@ object ALS {
         val defaultValue: Option[Int] = Some(10)
       }
     
    +  case object ImplicitPrefs extends Parameter[Boolean] {
    --- End diff --
    
    I suggest to leave it at 10 and do not give any advise to use highest possible number. It should work out-of-the-box, and to get good results, the user must fit the parameters to the use-case (such as the lambda coefficient of the regularization). Btw. Spark uses 10 by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81112526
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -99,6 +114,26 @@ The alternating least squares implementation can be controlled by the following
             </td>
           </tr>
           <tr>
    +        <td><strong>ImplicitPrefs</strong></td>
    +        <td>
    +          <p>
    +            Implicit property of the observations, meaning that they do not represent an explicit
    +            preference of the user, just the implicit information how many times the user consumed the
    --- End diff --
    
    Thanks. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87355415
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -675,7 +756,69 @@ object ALS {
               collector.collect((blockID, array))
             }
           }
    -    }.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +    }
    +
    +    // broadcasting XtX matrix in the implicit case
    +    val updatedFactorMatrix = if (implicitPrefs) {
    +      newMatrix.withBroadcastSet(XtXtoBroadcast.get, "XtX")
    +    } else {
    +      newMatrix
    +    }
    +
    +    updatedFactorMatrix.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +  }
    +
    +  /**
    +    * Computes the XtX matrix for the implicit version before updating the factors.
    +    * This matrix is intended to be broadcast, but as we cannot use a sink inside a Flink
    +    * iteration, so we represent it as a [[DataSet]] with a single element containing the matrix.
    +    *
    +    * The algorithm computes `X_i^T * X_i` for every block `X_i` of `X`,
    +    * then sums all these computed matrices to get `X^T * X`.
    +    */
    +  private[recommendation] def computeXtX(x: DataSet[(Int, Array[Array[Double]])], factors: Int):
    +  DataSet[Array[Double]] = {
    +    val triangleSize = factors * (factors - 1) / 2 + factors
    +
    +    type MtxBlock = (Int, Array[Array[Double]])
    +    // construct XtX for all blocks
    +    val xtx = x
    +      .mapPartition(new MapPartitionFunction[MtxBlock, Array[Double]]() {
    +        var xtxForBlock: Array[Double] = null
    +
    +        override def mapPartition(blocks: Iterable[(Int, Array[Array[Double]])],
    +                                  out: Collector[Array[Double]]): Unit = {
    +
    +          if (xtxForBlock == null) {
    +            // creating the matrix if not yet created
    +            xtxForBlock = Array.fill(triangleSize)(0.0)
    +          } else {
    +            // erasing the matrix
    +            var i = 0
    +            while (i < xtxForBlock.length) {
    --- End diff --
    
    I tried to avoid object creation, but I'm not sure if erasing works as well. By using `fill` a new `factors * factors` matrix would be created at every mapping. Am I right? Maybe that's not a big problem, as there is only one mapping for every partition, and the matrix is not that big. Maybe it was just premature optimization :) We could use fill, because that make the code cleaner. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hello @gaborhermann,
    
    Yes I think you are right in that respect, just wanted to note that we should perform some comparative benchmarks in the future.
    
    So the benchmarks look good IMHO, we now need to address the couple of comments I had, namely splitting up the tests and deciphering why a `java.Iterable` was used in that spot if possible.
    
    I was also wondering: For the expected results in the test, where did you get the reference data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80860320
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -99,6 +114,26 @@ The alternating least squares implementation can be controlled by the following
             </td>
           </tr>
           <tr>
    +        <td><strong>ImplicitPrefs</strong></td>
    +        <td>
    +          <p>
    +            Implicit property of the observations, meaning that they do not represent an explicit
    +            preference of the user, just the implicit information how many times the user consumed the
    --- End diff --
    
    Missing word at the end "consumed the ???". Would also change "explicit preference" to "explicit rating from the user".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80862241
  
    --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/recommendation/ImplicitALSTest.scala ---
    @@ -0,0 +1,171 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.ml.recommendation
    +
    +import org.apache.flink.ml.util.FlinkTestBase
    +import org.scalatest._
    +
    +import scala.language.postfixOps
    +import org.apache.flink.api.scala._
    +import org.apache.flink.core.testutils.CommonTestUtils
    +
    +class ImplicitALSTest
    +  extends FlatSpec
    +    with Matchers
    +    with FlinkTestBase {
    +
    +  override val parallelism = 2
    +
    +  behavior of "The modification of the alternating least squares (ALS) implementation" +
    +    "for implicit feedback datasets."
    +
    +  it should "properly compute Y^T * Y, and factorize matrix" in {
    +    import ExampleMatrix._
    +
    +    val rand = scala.util.Random
    +    val numBlocks = 3
    +    // randomly split matrix to blocks
    +    val blocksY = Y
    +      // add a random block id to every row
    +      .map { row =>
    +        (rand.nextInt(numBlocks), row)
    +      }
    +      // get the block via grouping
    +      .groupBy(_._1).values
    +      // add a block id (-1) to each block
    +      .map(b => (-1, b.map(_._2)))
    +      .toSeq
    +
    +    // use Flink to compute YtY
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val distribBlocksY = env.fromCollection(blocksY)
    +
    +    val YtY = ALS
    +      .computeXtX(distribBlocksY, factors)
    +      .collect().head
    +
    +    // check YtY size
    +    YtY.length should be (factors * (factors - 1) / 2 + factors)
    +
    +    // check result is as expected
    +    expectedUpperTriangleYtY
    +      .zip(YtY)
    +      .foreach { case (expected, result) =>
    +        result should be (expected +- 0.1)
    +      }
    +
    +    // temporary directory to avoid too few memory segments
    +    val tempDir = CommonTestUtils.getTempDir + "/"
    +
    +    // factorize matrix with implicit ALS
    +    val als = ALS()
    +      .setIterations(iterations)
    +      .setLambda(lambda)
    +      .setBlocks(blocks)
    +      .setNumFactors(factors)
    +      .setImplicit(true)
    +      .setAlpha(alpha)
    +      .setSeed(seed)
    +      .setTemporaryPath(tempDir)
    +
    +    val inputDS = env.fromCollection(implicitRatings)
    +
    +    als.fit(inputDS)
    +
    +    // check predictions on some user-item pairs
    +    val testData = env.fromCollection(expectedResult.map{
    +      case (userID, itemID, rating) => (userID, itemID)
    +    })
    +
    +    val predictions = als.predict(testData).collect()
    +
    +    predictions.length should equal(expectedResult.length)
    +
    +    val resultMap = expectedResult map {
    +      case (uID, iID, value) => (uID, iID) -> value
    +    } toMap
    +
    +    predictions foreach {
    +      case (uID, iID, value) => {
    +        resultMap.isDefinedAt((uID, iID)) should be(true)
    +
    +        value should be(resultMap((uID, iID)) +- 1e-5)
    +      }
    +    }
    +
    +  }
    +
    +}
    +
    +object ExampleMatrix {
    --- End diff --
    
    Data should go to the `Recommendation.scala` file, as with the plain ALS matrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81159482
  
    --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/recommendation/ImplicitALSTest.scala ---
    @@ -0,0 +1,171 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.ml.recommendation
    +
    +import org.apache.flink.ml.util.FlinkTestBase
    +import org.scalatest._
    +
    +import scala.language.postfixOps
    +import org.apache.flink.api.scala._
    +import org.apache.flink.core.testutils.CommonTestUtils
    +
    +class ImplicitALSTest
    +  extends FlatSpec
    +    with Matchers
    +    with FlinkTestBase {
    +
    +  override val parallelism = 2
    +
    +  behavior of "The modification of the alternating least squares (ALS) implementation" +
    +    "for implicit feedback datasets."
    +
    +  it should "properly compute Y^T * Y, and factorize matrix" in {
    --- End diff --
    
    AFAIK in the rest of the FlinkML tests we just use `val env = ExecutionEnvironment.getExecutionEnvironment`. I don't know if that policy has now changed, maybe @tillrohrmann can clarify.
    
    For now I would say to just split the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Okay. Thank you again @thvasilo for reviewing our code!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87195483
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -156,6 +171,26 @@ class ALS extends Predictor[ALS] {
         this
       }
     
    +  /** Sets the input observations to be implicit, thus using the iALS algorithm for learning.
    --- End diff --
    
    The docstring is not worded correctly, as the passed argument could be true or false.
    
    Should be prefixed with something like "When set to true, we assume implicit observations..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by mbalassi <gi...@git.apache.org>.
Github user mbalassi commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    @gaborhermann @thvasilo I would definitely like to see a test on a larger dataset, that is actually what I was asking for when I mentioned "benchmark", maybe I was not clear then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87353990
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -273,6 +308,14 @@ object ALS {
         val defaultValue: Option[Int] = Some(10)
       }
     
    +  case object ImplicitPrefs extends Parameter[Boolean] {
    --- End diff --
    
    I also changed this, but note that this is for the explicit ALS algorithm too. Do you think it's okay to give this recommendation for explicit case too? Really high number of factors (without regularization) might lead to overfitting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    @gaborhermann Yup the approach taken by the Spark community for testing is closer to what we would like to have for non-deterministic algorithms, but what you have implemented now should suffice on the assumption that the ALS implementation is correct.
    
    @tillrohrmann Initially implemented ALS so I'm not sure how he arrived at the expected results. It would be a good idea for the future to document how we generate test data so it's easy to replicate and validate the process. That should be enough for deterministic algorithms, and for non-deterministic we should have proxies like measuring the error of reconstruction etc.
    
    I'll take a look at the code again now, and will add comments if I find something. Otherwise I hope @mbalassi can find some time to review and merge if no objections come up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Thank you @thvasilo for your thorough review :)
    
    @mbalassi I think this PR is ready to merge. Could you do a review when you have some time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87422110
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -273,6 +308,14 @@ object ALS {
         val defaultValue: Option[Int] = Some(10)
       }
     
    +  case object ImplicitPrefs extends Parameter[Boolean] {
    --- End diff --
    
    You are correct the recommendation is from the iALS paper, but I'm not sure if the same holds for ALS. I trust your judgment here, since I'm not as familiar with xALS as I should be to have a good intuition about this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Thank you @jfeher!
    
    Could you clarify what you mean by filtering the data to get unique item-user pairs? Is this because the iALS algorithm only supports binary interactions (i.e. number of interactions does not play a role)?
    
    Any idea how these runtime numbers compare to alternative implementations (Spark, [Cython](https://github.com/benfred/implicit))?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Okay, I've made the seed optional. I also removed the `generateRandomMatrix` function as it wasn't used at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80860723
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -99,6 +114,26 @@ The alternating least squares implementation can be controlled by the following
             </td>
           </tr>
           <tr>
    +        <td><strong>ImplicitPrefs</strong></td>
    +        <td>
    +          <p>
    +            Implicit property of the observations, meaning that they do not represent an explicit
    +            preference of the user, just the implicit information how many times the user consumed the
    +            (Default value: <strong>false</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Alpha</strong></td>
    +        <td>
    +          <p>
    +            Weight of the positive implicit observations. Should be non-negative.
    +            Only relevant when ImplicitPrefs is set to true.
    +            (Default value: <strong>1</strong>)
    --- End diff --
    
    Can you provide some motivation for this default value? From the paper I see:
    
    > In our experiments, setting \u03b1 = 40 was found to produce good results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hi @thvasilo,
    
    Thanks for your thoughts! I agree we should perform a benchmark in the future. Furthermore, based on the results we could optimize the algorithm.
    
    I split up the test, and rebased to the current master. I checked the `java.Iterable` again, and commented at your original concern. I am afraid we'll have to use the `java.Iterable`.
    
    Regarding the expected results, I've only generated the small input data by hand. Before that I checked whether the Spark and Flink implementations converged to approximately same factor matrices (I only checked the value of the objective function, not the whole matrices). Because of the random initialization we cannot guarantee to have the same results, but there were 2-3 points that both Spark and Flink converged to.
    
    There might be better methods for testing, but I considered this sufficient as the original `ALSITSuite` did nothing more. Of course, this test only checks whether the algorithm works the same way after some modifications (e.g. optimization), and does not check whether the algorithm initially works or not, but it's the same case with the original ALS. Do you know what is the assurance for the explicit ALS working good? (It must be good, as I also checked the results of the explicit ALS against Spark on toy-data.) AFAIK Spark generates random matrices of known rank, factorizes them, and checks whether the error is low (see their [ALSSuite](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/recommendation/ALSSuite.scala)). In the future, it might be worth to follow their approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81114406
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -581,6 +637,16 @@ object ALS {
             val userXy = new ArrayBuffer[Array[Double]]()
             val numRatings = new ArrayBuffer[Int]()
     
    +        var precomputedXtX: Array[Double] = null
    +
    +        override def open(config: Configuration): Unit = {
    +          // retrieve broadcasted precomputed XtX if using implicit feedback
    +          if (implicitPrefs) {
    +            precomputedXtX = getRuntimeContext.getBroadcastVariable[Array[Double]]("XtX")
    +              .iterator().next()
    +          }
    +        }
    +
             override def coGroup(left: lang.Iterable[(Int, Int, Array[Array[Double]])],
    --- End diff --
    
    If I see it right, I did not change this line, it was in the original ALS implementation. However, I can't find any reason to use the Java `Iterable`.
    
    There could be other minor things to refactor in the original ALS code, but I preferred to keep them as they were, not to break anything. Should I refactor some parts along the way when I extend an algorithm like this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    @tillrohrmann thanks for clarifying!
    Beside testing against Spark, I also did a small NumPy implementation, to make sure iALS works well. Although, I did not make efforts to imitate the initial random factor generation of Flink, so the results cannot be exactly the same. I am also happy to share that if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    @thvasilo you're right in the future we should make sure that we properly document how we came up with the testing data.
    
    What I've done to come up with the testing data for the `ALSITSuite` is to factorize the given recommendation matrix with Matlab with the given parameters (factors, iterations, etc). I've implemented ALS in Matlab for that. Ideally, we use a library for that so that others can reproduce the test results. I can share the ALS implementation if you like.
    
    Ideally we do something similar for iALS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] Extend ALS to handle implicit feedback datas...

Posted by mbalassi <gi...@git.apache.org>.
Github user mbalassi commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    @gaborhermann @jfeher Could you share results where you benchmarked the algorithm against other implementations, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by mbalassi <gi...@git.apache.org>.
Github user mbalassi commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Yes @gaborhermann , I finally got here. \U0001f604 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] Extend ALS to handle implicit feedback datas...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hello @gaborhermann thank you for your contribution! 
    Are the numbers here non-zero entries in a matrix?
    If that is the case do you think it would be possible to test this on some larger scale datasets?
    
    This would bring it closer to actual use cases someone using Flink might have.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    I agree. I already fixed the seed at the `ImplicitALSITSuite`. At the `ALSITSuite` the seed is unset, but by default it's 0, that's why `ALSITSuite` is deterministic, so I don't think it's a problem. Although, I would reconsider setting the default seed, because that way two training for the same parameters yields the same result, and that might not be what do user expects. The user might expect truly randomized result.
    
    Do you think we should modify the default optional seed `Some(0L)` to be `None`, and use the system random generator by default? Again, this is in the original algorithm. Should we do it in this PR or create another issue for refactoring ALS?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80862018
  
    --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/recommendation/ImplicitALSTest.scala ---
    @@ -0,0 +1,171 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.ml.recommendation
    +
    +import org.apache.flink.ml.util.FlinkTestBase
    +import org.scalatest._
    +
    +import scala.language.postfixOps
    +import org.apache.flink.api.scala._
    +import org.apache.flink.core.testutils.CommonTestUtils
    +
    +class ImplicitALSTest
    +  extends FlatSpec
    +    with Matchers
    +    with FlinkTestBase {
    +
    +  override val parallelism = 2
    +
    +  behavior of "The modification of the alternating least squares (ALS) implementation" +
    +    "for implicit feedback datasets."
    +
    +  it should "properly compute Y^T * Y, and factorize matrix" in {
    --- End diff --
    
    Are you testing two functionalities in this test? If yes I suggest to split them to two functional units.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80860057
  
    --- Diff: docs/dev/libs/ml/als.md ---
    @@ -49,6 +49,21 @@ By applying this step alternately to the matrices $U$ and $V$, we can iterativel
     
     The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
     
    +An alternative model can be used for _implicit feedback_ datasets.
    +These datasets only contain implicit feedback from the user
    +in contrast to datasets with explicit feedback like movie ratings.
    +For example users watch videos on a website and the website monitors which user
    +viewed which video, so the users only provide their preference implicitly.
    +In these cases the feedback should not be treated as a
    +rating, but rather an evidence that the user prefers that item.
    +Thus, for implicit feedback datasets there is a slightly different
    +minimalization problem to solve (see [Hu et al.](http://dx.doi.org/10.1109/ICDM.2008.22) for details).
    --- End diff --
    
    Change "minimalization" to "optimization".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87354805
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -535,8 +581,17 @@ object ALS {
         itemOut: DataSet[(Int, OutBlockInformation)],
         userIn: DataSet[(Int, InBlockInformation)],
         factors: Int,
    -    lambda: Double, blockIDPartitioner: FlinkPartitioner[Int]):
    +    lambda: Double, blockIDPartitioner: FlinkPartitioner[Int],
    +    implicitPrefs: Boolean,
    +    alpha: Double):
       DataSet[(Int, Array[Array[Double]])] = {
    +    // retrieve broadcast XtX matrix in implicit case
    +    val XtXtoBroadcast = if (implicitPrefs) {
    --- End diff --
    
    I tried to fit the notation of the explicit ALS code, as it uses `userXtX` notation in the `updateFactors` function. I think it might be confusing to use two different notations is the code, even if the paper uses another notation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87201508
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -535,8 +581,17 @@ object ALS {
         itemOut: DataSet[(Int, OutBlockInformation)],
         userIn: DataSet[(Int, InBlockInformation)],
         factors: Int,
    -    lambda: Double, blockIDPartitioner: FlinkPartitioner[Int]):
    +    lambda: Double, blockIDPartitioner: FlinkPartitioner[Int],
    +    implicitPrefs: Boolean,
    +    alpha: Double):
       DataSet[(Int, Array[Array[Double]])] = {
    +    // retrieve broadcast XtX matrix in implicit case
    +    val XtXtoBroadcast = if (implicitPrefs) {
    --- End diff --
    
    I'm a bit confused with the notation here, is this matrix the `YtY ` matrix from the paper? If yes, I would recommend sticking to the notation of the paper to avoid confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87421513
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -675,7 +756,69 @@ object ALS {
               collector.collect((blockID, array))
             }
           }
    -    }.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +    }
    +
    +    // broadcasting XtX matrix in the implicit case
    +    val updatedFactorMatrix = if (implicitPrefs) {
    +      newMatrix.withBroadcastSet(XtXtoBroadcast.get, "XtX")
    +    } else {
    +      newMatrix
    +    }
    +
    +    updatedFactorMatrix.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +  }
    +
    +  /**
    +    * Computes the XtX matrix for the implicit version before updating the factors.
    +    * This matrix is intended to be broadcast, but as we cannot use a sink inside a Flink
    +    * iteration, so we represent it as a [[DataSet]] with a single element containing the matrix.
    +    *
    +    * The algorithm computes `X_i^T * X_i` for every block `X_i` of `X`,
    +    * then sums all these computed matrices to get `X^T * X`.
    +    */
    +  private[recommendation] def computeXtX(x: DataSet[(Int, Array[Array[Double]])], factors: Int):
    +  DataSet[Array[Double]] = {
    +    val triangleSize = factors * (factors - 1) / 2 + factors
    +
    +    type MtxBlock = (Int, Array[Array[Double]])
    +    // construct XtX for all blocks
    +    val xtx = x
    +      .mapPartition(new MapPartitionFunction[MtxBlock, Array[Double]]() {
    +        var xtxForBlock: Array[Double] = null
    +
    +        override def mapPartition(blocks: Iterable[(Int, Array[Array[Double]])],
    +                                  out: Collector[Array[Double]]): Unit = {
    +
    +          if (xtxForBlock == null) {
    +            // creating the matrix if not yet created
    +            xtxForBlock = Array.fill(triangleSize)(0.0)
    +          } else {
    +            // erasing the matrix
    +            var i = 0
    +            while (i < xtxForBlock.length) {
    --- End diff --
    
    I don't imagine this making a major difference in performance, so let's just go with the cleaner code angle and use `fill`.
    
    I wish we had an easy to use integrated way to do proper profiling so such decisions can be easier (i.e. if this is 0.5% of the CPU cost, then optimizing is pointless but right now we don't know)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87202604
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -675,7 +756,69 @@ object ALS {
               collector.collect((blockID, array))
             }
           }
    -    }.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +    }
    +
    +    // broadcasting XtX matrix in the implicit case
    +    val updatedFactorMatrix = if (implicitPrefs) {
    +      newMatrix.withBroadcastSet(XtXtoBroadcast.get, "XtX")
    +    } else {
    +      newMatrix
    +    }
    +
    +    updatedFactorMatrix.withForwardedFieldsFirst("0").withForwardedFieldsSecond("0")
    +  }
    +
    +  /**
    +    * Computes the XtX matrix for the implicit version before updating the factors.
    +    * This matrix is intended to be broadcast, but as we cannot use a sink inside a Flink
    +    * iteration, so we represent it as a [[DataSet]] with a single element containing the matrix.
    +    *
    +    * The algorithm computes `X_i^T * X_i` for every block `X_i` of `X`,
    +    * then sums all these computed matrices to get `X^T * X`.
    +    */
    +  private[recommendation] def computeXtX(x: DataSet[(Int, Array[Array[Double]])], factors: Int):
    +  DataSet[Array[Double]] = {
    +    val triangleSize = factors * (factors - 1) / 2 + factors
    +
    +    type MtxBlock = (Int, Array[Array[Double]])
    +    // construct XtX for all blocks
    +    val xtx = x
    +      .mapPartition(new MapPartitionFunction[MtxBlock, Array[Double]]() {
    +        var xtxForBlock: Array[Double] = null
    +
    +        override def mapPartition(blocks: Iterable[(Int, Array[Array[Double]])],
    +                                  out: Collector[Array[Double]]): Unit = {
    +
    +          if (xtxForBlock == null) {
    +            // creating the matrix if not yet created
    +            xtxForBlock = Array.fill(triangleSize)(0.0)
    +          } else {
    +            // erasing the matrix
    +            var i = 0
    +            while (i < xtxForBlock.length) {
    --- End diff --
    
    Any reason why `fill` is not/cannot be used here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87353332
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -156,6 +171,26 @@ class ALS extends Predictor[ALS] {
         this
       }
     
    +  /** Sets the input observations to be implicit, thus using the iALS algorithm for learning.
    --- End diff --
    
    Good point. Changed the wording.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    @gaborhermann I think have a larger scale test would boost our confidence in the implementation, and maybe point out some problems that do not manifest with small data, which is very common.
    
    If you plan to do it anyway we might as well do it before merging the code, but it's not a blocker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Okay, we're on it.
    @thvasilo thanks for the dataset suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] Extend ALS to handle implicit feedback datas...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    We did not measure performance against Spark or other implementations yet. Those would reflect the performance of Flink ALS implementation, as there is not much difference between the implicit and explicit implementations.
    
    Instead, we compared the implicit case with the explicit case in the Flink implementation on the same datasets, to make sure the implicit case does not decrease the performance significantly. (Of course, we expected the implicit case to be slower due to the extra precomputation and broadcasting of `Xt * X`.)
    
    ```
            expl  impl
    100     8885   9196
    1000    7879  11282
    10000   8839   9220
    100000  7102  10998
    1000000 7543  10680
    ```
    
    The numbers in the left column indicate the size of the training set (I'm not sure about the measure, but @jfeher can tell about it). The numbers are the training time in milliseconds in the explicit and implicit case respectively. We did the measurements on a small cluster of 3 nodes.
    
    It seems, there is a large constant overhead, but it's not significantly slower in the implicit case.
    We could do further, more thorough measurements if needed, but maybe that would be another issue. Benchmarking more and optimizing both the original ALS algorithm and the specific `Xt * X` computation in the implicit case could be a separate PR.
    
    What are your thoughts on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Thanks @jfeher for the measurements! :)
    
    @thvasilo The filtering referred to having distinct (user,artist) pairs. It's only because the input of the iALS is a sparse matrix, and it would not make much sense to have more than one values for the same element of the matrix. E.g. to aggregate multiple listenings for the same (user,artist) pair, one could count them, and use the count as the implicit rating. We simply used the value 1.0 for every user-artist pair, but the algorithm works with any (positive) values, not only binary interactions.
    
    We've only measured Flink against itself, as the main ALS algorithm is already in Flink. It would be interesting to measure against Spark and other solutions, but that might not reflect the performance of our iALS extension, but rather the performance of ALS itself. That seems to be another issue for me. Do I see this right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] Extend ALS to handle implicit feedbac...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r80861067
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -581,6 +637,16 @@ object ALS {
             val userXy = new ArrayBuffer[Array[Double]]()
             val numRatings = new ArrayBuffer[Int]()
     
    +        var precomputedXtX: Array[Double] = null
    +
    +        override def open(config: Configuration): Unit = {
    +          // retrieve broadcasted precomputed XtX if using implicit feedback
    +          if (implicitPrefs) {
    +            precomputedXtX = getRuntimeContext.getBroadcastVariable[Array[Double]]("XtX")
    +              .iterator().next()
    +          }
    +        }
    +
             override def coGroup(left: lang.Iterable[(Int, Int, Array[Array[Double]])],
    --- End diff --
    
    Is this a Java iterable here? Any reason to use this instead of the Scala `Iterable` trait?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r81115301
  
    --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/recommendation/ImplicitALSTest.scala ---
    @@ -0,0 +1,171 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.ml.recommendation
    +
    +import org.apache.flink.ml.util.FlinkTestBase
    +import org.scalatest._
    +
    +import scala.language.postfixOps
    +import org.apache.flink.api.scala._
    +import org.apache.flink.core.testutils.CommonTestUtils
    +
    +class ImplicitALSTest
    +  extends FlatSpec
    +    with Matchers
    +    with FlinkTestBase {
    +
    +  override val parallelism = 2
    +
    +  behavior of "The modification of the alternating least squares (ALS) implementation" +
    +    "for implicit feedback datasets."
    +
    +  it should "properly compute Y^T * Y, and factorize matrix" in {
    +    import ExampleMatrix._
    +
    +    val rand = scala.util.Random
    +    val numBlocks = 3
    +    // randomly split matrix to blocks
    +    val blocksY = Y
    +      // add a random block id to every row
    +      .map { row =>
    +        (rand.nextInt(numBlocks), row)
    +      }
    +      // get the block via grouping
    +      .groupBy(_._1).values
    +      // add a block id (-1) to each block
    +      .map(b => (-1, b.map(_._2)))
    +      .toSeq
    +
    +    // use Flink to compute YtY
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val distribBlocksY = env.fromCollection(blocksY)
    +
    +    val YtY = ALS
    +      .computeXtX(distribBlocksY, factors)
    +      .collect().head
    +
    +    // check YtY size
    +    YtY.length should be (factors * (factors - 1) / 2 + factors)
    +
    +    // check result is as expected
    +    expectedUpperTriangleYtY
    +      .zip(YtY)
    +      .foreach { case (expected, result) =>
    +        result should be (expected +- 0.1)
    +      }
    +
    +    // temporary directory to avoid too few memory segments
    +    val tempDir = CommonTestUtils.getTempDir + "/"
    +
    +    // factorize matrix with implicit ALS
    +    val als = ALS()
    +      .setIterations(iterations)
    +      .setLambda(lambda)
    +      .setBlocks(blocks)
    +      .setNumFactors(factors)
    +      .setImplicit(true)
    +      .setAlpha(alpha)
    +      .setSeed(seed)
    +      .setTemporaryPath(tempDir)
    +
    +    val inputDS = env.fromCollection(implicitRatings)
    +
    +    als.fit(inputDS)
    +
    +    // check predictions on some user-item pairs
    +    val testData = env.fromCollection(expectedResult.map{
    +      case (userID, itemID, rating) => (userID, itemID)
    +    })
    +
    +    val predictions = als.predict(testData).collect()
    +
    +    predictions.length should equal(expectedResult.length)
    +
    +    val resultMap = expectedResult map {
    +      case (uID, iID, value) => (uID, iID) -> value
    +    } toMap
    +
    +    predictions foreach {
    +      case (uID, iID, value) => {
    +        resultMap.isDefinedAt((uID, iID)) should be(true)
    +
    +        value should be(resultMap((uID, iID)) +- 1e-5)
    +      }
    +    }
    +
    +  }
    +
    +}
    +
    +object ExampleMatrix {
    --- End diff --
    
    Thanks. Moved the data to `Recommendation.scala`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by jfeher <gi...@git.apache.org>.
Github user jfeher commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hi, we have measured the training time of als and ials with the given dataset.
    After filtering the data to unique item user pairs we got approximatly 64 million rankings.
    
    We measured on a cluster with four nodes and on yarn. All of the nodes had 16 GB of memory. 
    The taskmanagers got 12 GB and the jobmanager got 2 GB.
    We had four taskmanagers, one four each node.
    After some testing it looked like a block number between 100 and 1500 is the most optimal.
    And between 100 and 300 the running times were steadily low.
    
    **For ials we got the following measurments:**
    
    The average time for block numbers between 100 and 1500 and 1 iteration in seconds: 2000.33s
    
    The average time for block numbers between 100 and 300 and 1 iteration in seconds: 1729.44s
    
    More detailed results by block sizes on the diagram: http://imgur.com/LjJavti
    
    **For als with the same configurations we got the following measurments:**
    
    The average time for block numbers between 100 and 1500 and 1 iteration in seconds: 1694.04s
    
    The average time for block numbers between 100 and 300 and 1 iteration in seconds: 1465.77s
    
    So the ials version was 300 s slower on this data than the als.
    
    When we increased the iteration number for 10 the time difference stayed under 1000 s which is less than ten times 300.
    This is because the fix time cost for the whole training is big.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2542: [FLINK-4613] [ml] Extend ALS to handle implicit feedback ...

Posted by gaborhermann <gi...@git.apache.org>.
Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hi @thvasilo, thanks for reviewing my code.
    
    You are right, @jfeher told me that the numbers in the left column indicate the number of non-zero elements in the rating matrix. We are actually planning to do experiments on a larger data set, but not right now.
    Do you think it's important to do so before merging the PR? If so we'll do it ASAP.
    
    Thanks for the suggestion, I marked the PR with the [ml] tag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r87199446
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -273,6 +308,14 @@ object ALS {
         val defaultValue: Option[Int] = Some(10)
       }
     
    +  case object ImplicitPrefs extends Parameter[Boolean] {
    --- End diff --
    
    Can't find a way to comment on line 264/299 but we should take the opportunity to set the default number of factors to a more reasonable 50, and add to the docstring and documentation the recommendation:
    
    > we recommend working with the highest number of factors feasible within computational limitations.
    
    Which comes straight from the iALS paper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2542: [FLINK-4613] [ml] Extend ALS to handle implicit fe...

Posted by mbalassi <gi...@git.apache.org>.
Github user mbalassi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2542#discussion_r90032424
  
    --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala ---
    @@ -581,6 +637,16 @@ object ALS {
             val userXy = new ArrayBuffer[Array[Double]]()
             val numRatings = new ArrayBuffer[Int]()
     
    +        var precomputedXtX: Array[Double] = null
    +
    +        override def open(config: Configuration): Unit = {
    +          // retrieve broadcasted precomputed XtX if using implicit feedback
    +          if (implicitPrefs) {
    +            precomputedXtX = getRuntimeContext.getBroadcastVariable[Array[Double]]("XtX")
    +              .iterator().next()
    +          }
    +        }
    +
             override def coGroup(left: lang.Iterable[(Int, Int, Array[Array[Double]])],
    --- End diff --
    
    I agree with @gaborhermann here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---