You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rezazadeh <gi...@git.apache.org> on 2014/10/01 23:45:13 UTC

[GitHub] spark pull request: CosineSimilarity Example

GitHub user rezazadeh opened a pull request:

    https://github.com/apache/spark/pull/2622

    CosineSimilarity Example

    Provide example  for `RowMatrix.columnSimilarity()`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rezazadeh/spark dimsumexample

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2622.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2622
    
----
commit 453357938733ab3172d7b4c8f959f25de9cdbbc9
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-01T21:42:58Z

    CosineSimilarity Example

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58278861
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21410/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18553438
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    --- End diff --
    
    `org.apache.spark.examples.mllib.CosineSimilarity` -> `mllib.CosineSimilarity` (we don't need `org.apache.spark.examples` with `run-example`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2622


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18554296
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    --- End diff --
    
    Turn `input` into positional parameter


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18554285
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    --- End diff --
    
    Changed to just mllib.CosineSimilarity


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57550522
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21161/consoleFull) for   PR 2622 at commit [`4533579`](https://github.com/apache/spark/commit/453357938733ab3172d7b4c8f959f25de9cdbbc9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58278857
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21410/consoleFull) for   PR 2622 at commit [`e573c7a`](https://github.com/apache/spark/commit/e573c7a8c6e9a5a23c7121ecac743f0d53849e11).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(inputFile: String = null, threshold: Double = 0.1)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18554272
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    --- End diff --
    
    Moved above spark imports and separated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18554792
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    +        .required()
    +        .text(s"input file, one row per line, space-separated")
    +        .action((x, c) => c.copy(inputFile = x))
    +      opt[Double]("threshold")
    +        .required()
    +        .text(s"threshold similarity: to tradeoff computation vs quality estimate")
    +        .action((x, c) => c.copy(threshold = x))
    +      note(
    +        """
    +          |For example, the following command runs this app on a dataset:
    +          |
    +          | ./bin/spark-submit  --class org.apache.spark.examples.mllib.CosineSimilarity \
    +          | examplesjar.jar \
    +          | --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    +        """.stripMargin)
    +    }
    +
    +    parser.parse(args, defaultParams).map { params =>
    +      run(params)
    +    } getOrElse {
    +      System.exit(1)
    +    }
    +  }
    +
    +  def run(params: Params) {
    +    val conf = new SparkConf().setAppName("CosineSimilarity")
    +    val sc = new SparkContext(conf)
    +
    +    // Load and parse the data file.
    +    val rows = sc.textFile(params.inputFile).map { line =>
    --- End diff --
    
    You're right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58275473
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21414/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18553440
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    --- End diff --
    
    `input` should be a positional parameter. (See BinaryClassification.scala for example)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLlib] CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57557761
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21161/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58091578
  
    @rezazadeh Could you update the example using `scopt` to parse parameters? You can check other example code for its usage. We try to be consistent across example code. It should take `--threshold` as a required parameter. For the evaluation, we can do it distributively, for example:
    
    ~~~
    val MAE = exact.entries.map { case MatrixEntry(i, j, u) =>
      ((i, j), u)
    }.leftOuterJoin(
      approx.entries.map { case MatrixEntry(i, j, v) =>
        ((i, j), v)
    }).values.map { 
      case (u, Some(v)) =>
        math.abs(u - v)
      case (u, None) =>
        math.abs(u)
    }.mean()
    ~~~
    
    I use MAE here but I'm not sure which metric matches the theory.
    
    Btw, I created a JIRA to have a specialized version of exact similarity computation, which doesn't require sampling. Let me know if you are interested working on it: https://issues.apache.org/jira/browse/SPARK-3820


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58275225
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21415/consoleFull) for   PR 2622 at commit [`379066d`](https://github.com/apache/spark/commit/379066db5b1d212bed9cbd60a28c4fa2b5927937).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57897801
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21287/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18553436
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    --- End diff --
    
    separate this import from spark imports and move it before spark imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57897798
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21287/consoleFull) for   PR 2622 at commit [`eca3dfd`](https://github.com/apache/spark/commit/eca3dfd62c1ce3643ef03b44f79c3e840b27a390).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58280905
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57896484
  
    Parameters are now configurable.
    Added approximation error reporting.
    Added JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18553677
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    +        .required()
    +        .text(s"input file, one row per line, space-separated")
    +        .action((x, c) => c.copy(inputFile = x))
    +      opt[Double]("threshold")
    +        .required()
    +        .text(s"threshold similarity: to tradeoff computation vs quality estimate")
    +        .action((x, c) => c.copy(threshold = x))
    +      note(
    +        """
    +          |For example, the following command runs this app on a dataset:
    +          |
    +          | ./bin/spark-submit  --class org.apache.spark.examples.mllib.CosineSimilarity \
    +          | examplesjar.jar \
    +          | --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    +        """.stripMargin)
    +    }
    +
    +    parser.parse(args, defaultParams).map { params =>
    +      run(params)
    +    } getOrElse {
    +      System.exit(1)
    +    }
    +  }
    +
    +  def run(params: Params) {
    +    val conf = new SparkConf().setAppName("CosineSimilarity")
    +    val sc = new SparkContext(conf)
    +
    +    // Load and parse the data file.
    +    val rows = sc.textFile(params.inputFile).map { line =>
    +      val values = line.split(' ').map(_.toDouble)
    +      Vectors.dense(values)
    +    }
    +    val mat = new RowMatrix(rows)
    +
    +    // Compute similar columns perfectly, with brute force.
    +    val exact = mat.columnSimilarities()
    +
    +    // Compute similar columns with estimation using DIMSUM
    +    val approx = mat.columnSimilarities(params.threshold)
    +
    +    val MAE = exact.entries.map { case MatrixEntry(i, j, u) =>
    --- End diff --
    
    This is my bad. Let's make this block easier to read.
    
    ~~~
    val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
    val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
    val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
      case (u, Some(v)) =>
        math.abs(u - v)
      case (u, None) =>
        math.abs(u)
    }.mean()
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58283287
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21420/consoleFull) for   PR 2622 at commit [`8f20b82`](https://github.com/apache/spark/commit/8f20b825ddc3291982801e79517872cca61304b5).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(inputFile: String = null, threshold: Double = 0.1)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58276749
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21420/consoleFull) for   PR 2622 at commit [`8f20b82`](https://github.com/apache/spark/commit/8f20b825ddc3291982801e79517872cca61304b5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58283292
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21420/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18554322
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    +        .required()
    +        .text(s"input file, one row per line, space-separated")
    +        .action((x, c) => c.copy(inputFile = x))
    +      opt[Double]("threshold")
    +        .required()
    +        .text(s"threshold similarity: to tradeoff computation vs quality estimate")
    +        .action((x, c) => c.copy(threshold = x))
    +      note(
    +        """
    +          |For example, the following command runs this app on a dataset:
    +          |
    +          | ./bin/spark-submit  --class org.apache.spark.examples.mllib.CosineSimilarity \
    +          | examplesjar.jar \
    +          | --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    +        """.stripMargin)
    +    }
    +
    +    parser.parse(args, defaultParams).map { params =>
    +      run(params)
    +    } getOrElse {
    +      System.exit(1)
    +    }
    +  }
    +
    +  def run(params: Params) {
    +    val conf = new SparkConf().setAppName("CosineSimilarity")
    +    val sc = new SparkContext(conf)
    +
    +    // Load and parse the data file.
    +    val rows = sc.textFile(params.inputFile).map { line =>
    --- End diff --
    
    No that doesn't make sense since the input is an arbitrary matrix but libsvm comes with labels. cached rows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57557756
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21161/consoleFull) for   PR 2622 at commit [`4533579`](https://github.com/apache/spark/commit/453357938733ab3172d7b4c8f959f25de9cdbbc9).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `          println(s"Failed to load main class $childMainClass.")`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57546461
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21160/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58271836
  
    @mengxr 1) Started using scopt, and 2) Distributed the error computation per your suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58271994
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21410/consoleFull) for   PR 2622 at commit [`e573c7a`](https://github.com/apache/spark/commit/e573c7a8c6e9a5a23c7121ecac743f0d53849e11).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18553443
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    +        .required()
    +        .text(s"input file, one row per line, space-separated")
    +        .action((x, c) => c.copy(inputFile = x))
    +      opt[Double]("threshold")
    +        .required()
    +        .text(s"threshold similarity: to tradeoff computation vs quality estimate")
    +        .action((x, c) => c.copy(threshold = x))
    +      note(
    +        """
    +          |For example, the following command runs this app on a dataset:
    +          |
    +          | ./bin/spark-submit  --class org.apache.spark.examples.mllib.CosineSimilarity \
    +          | examplesjar.jar \
    +          | --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    +        """.stripMargin)
    +    }
    +
    +    parser.parse(args, defaultParams).map { params =>
    +      run(params)
    +    } getOrElse {
    +      System.exit(1)
    +    }
    +  }
    +
    +  def run(params: Params) {
    +    val conf = new SparkConf().setAppName("CosineSimilarity")
    +    val sc = new SparkContext(conf)
    +
    +    // Load and parse the data file.
    +    val rows = sc.textFile(params.inputFile).map { line =>
    --- End diff --
    
    Shall we use LIBSVM format? Please cache `rows`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57896544
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21287/consoleFull) for   PR 2622 at commit [`eca3dfd`](https://github.com/apache/spark/commit/eca3dfd62c1ce3643ef03b44f79c3e840b27a390).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2622#discussion_r18554333
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala ---
    @@ -0,0 +1,109 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.{SparkConf, SparkContext}
    +import scopt.OptionParser
    +
    +/**
    + * Compute the similar columns of a matrix, using cosine similarity.
    + *
    + * The input matrix must be stored in row-oriented dense format, one line per row with its entries
    + * separated by space. For example,
    + * {{{
    + * 0.5 1.0
    + * 2.0 3.0
    + * 4.0 5.0
    + * }}}
    + * represents a 3-by-2 matrix, whose first row is (0.5, 1.0).
    + *
    + * Example invocation:
    + *
    + * bin/run-example org.apache.spark.examples.mllib.CosineSimilarity \
    + * --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    + */
    +object CosineSimilarity {
    +  case class Params(inputFile: String = null, threshold: Double = 0.1)
    +
    +  def main(args: Array[String]) {
    +    val defaultParams = Params()
    +
    +    val parser = new OptionParser[Params]("CosineSimilarity") {
    +      head("CosineSimilarity: an example app.")
    +      opt[String]("inputFile")
    +        .required()
    +        .text(s"input file, one row per line, space-separated")
    +        .action((x, c) => c.copy(inputFile = x))
    +      opt[Double]("threshold")
    +        .required()
    +        .text(s"threshold similarity: to tradeoff computation vs quality estimate")
    +        .action((x, c) => c.copy(threshold = x))
    +      note(
    +        """
    +          |For example, the following command runs this app on a dataset:
    +          |
    +          | ./bin/spark-submit  --class org.apache.spark.examples.mllib.CosineSimilarity \
    +          | examplesjar.jar \
    +          | --inputFile data/mllib/sample_svm_data.txt --threshold 0.1
    +        """.stripMargin)
    +    }
    +
    +    parser.parse(args, defaultParams).map { params =>
    +      run(params)
    +    } getOrElse {
    +      System.exit(1)
    +    }
    +  }
    +
    +  def run(params: Params) {
    +    val conf = new SparkConf().setAppName("CosineSimilarity")
    +    val sc = new SparkContext(conf)
    +
    +    // Load and parse the data file.
    +    val rows = sc.textFile(params.inputFile).map { line =>
    +      val values = line.split(' ').map(_.toDouble)
    +      Vectors.dense(values)
    +    }
    +    val mat = new RowMatrix(rows)
    +
    +    // Compute similar columns perfectly, with brute force.
    +    val exact = mat.columnSimilarities()
    +
    +    // Compute similar columns with estimation using DIMSUM
    +    val approx = mat.columnSimilarities(params.threshold)
    +
    +    val MAE = exact.entries.map { case MatrixEntry(i, j, u) =>
    --- End diff --
    
    Changed to your suggestion, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: CosineSimilarity Example

Posted by rezazadeh <gi...@git.apache.org>.
Github user rezazadeh commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-57549440
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58281760
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21415/consoleFull) for   PR 2622 at commit [`379066d`](https://github.com/apache/spark/commit/379066db5b1d212bed9cbd60a28c4fa2b5927937).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(inputFile: String = null, threshold: Double = 0.1)`
      * `class RandomForestModel(val trees: Array[DecisionTreeModel], val algo: Algo) extends Serializable `
      * `class PStatsParam(AccumulatorParam):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3790][MLlib] CosineSimilarity Example

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2622#issuecomment-58281770
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21415/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org