You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yu-iskw <gi...@git.apache.org> on 2014/08/15 09:49:52 UTC

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

GitHub user yu-iskw opened a pull request:

    https://github.com/apache/spark/pull/1964

    [SPARK-3012] Standardized Distance Functions between two Vectors for MLlib

    https://issues.apache.org/jira/browse/SPARK-3012
    
    I implemented some distance measures between two Vector instances, such as Manhattan distance and Euclidean distance.Because the standardized distance functions help us to implement more machine learning algorithms efficiently

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yu-iskw/spark branch-1.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1964.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1964
    
----
commit 0e81b1239fe82d55feb9dca0ca7441d2e61aea14
Author: Yuu ISHIKAWA <yu...@gmail.com>
Date:   2014-08-15T07:35:52Z

    [SPARK-3012] Standardized Distance Functions between two Vectors for MLlib
    
    https://issues.apache.org/jira/browse/SPARK-3012

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53581400
  
    It seems most elegant and scala-like to directly supply the distance function as the parameter, instead of just a 'type' that selects some internal function.  If you don't like one of the pre-defined distance functions, write your own and use it as the parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53547074
  
    @yu-iskw Sorry for the delay in code review! What do you expect users to do with the distances?
    
    For example, users can pick different distance measures in k-means. In that case, we should hide the distance implementation from users, and let users specify the distance type by its string name. So we can easily extend it to PySpark.
    
    Another use case is to let users compute various distance measures with MLlib's vectors. We try to keep MLlib's linear algebra implementation lightweight, given the fact that there are many linear algebra libraries, e.g., Breeze. In this case, it may be useful to contribute a `dist(v1, v2, type)` operator to Breeze and then call it in MLlib's algorithms. @dlwh


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52537821
  
    QA results for PR 1964:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class ChebyshevDistanceMeasure extends DistanceMeasure {<br>class CosineDistanceMeasure extends DistanceMeasure {<br>trait DistanceMeasure extends Serializable {<br>class EuclideanDistanceMeasure extends SquaredEuclideanDistanceMeasure {<br>class ManhattanDistanceMeasure extends DistanceMeasure {<br>class MinkowskiDistanceMeasure(val exponent: Double) extends DistanceMeasure{<br>class SquaredEuclideanDistanceMeasure extends DistanceMeasure {<br>class TanimotoDistanceMeasure extends DistanceMeasure {<br>abstract class WeightedDistanceMeasure(weight: Vector) extends DistanceMeasure {<br>class WeightedEuclideanDistanceMeasure(weight: Vector) extends WeightedDistanceMeasure(weight) {<br>class WeightedManhattanDistanceMeasure(val weight: Vector) extends WeightedDistanceMeasure(weight) {<br><br>For more information see test ouptu
 t:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18761/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54960119
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20038/consoleFull) for   PR 1964 at commit [`6723187`](https://github.com/apache/spark/commit/672318743e45e307983c4b33b03b562735e9a4bb).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52756243
  
    Hi @erikerlandson, 
    
    Thank you for your comment.
    I modify the code according to your advice.
    Would you please check it.
    
    - DistanceMeasure trait extends Function2[Vector, Vector, Double]
    - Add implicit method for DistanceMeasure
    - Add DistanceMetric trait for not distance function
    -- ex) CosineDistanceMeasure, TanimotoDistanceMeasure, SquaredEuclideanMeasure are subclass it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54806883
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19976/consoleFull) for   PR 1964 at commit [`a44c30c`](https://github.com/apache/spark/commit/a44c30c2b8a549b04c57bb3cda3c46cb80495590).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-59885635
  
    Because this patch is not fit for the Spark design concept, I close this PR without merging.
    (http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-td8697.html)
    Thank you very much for your cooperation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by rnowling <gi...@git.apache.org>.
Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53010747
  
    Overall, changes look great!  Thank you for considering our feedback.
    
    I like that you were able to reduce the number of files.  I think SquaredEuclideanDistanceMeasure.scala probably needs to deleted -- it looks empty.  
    
    I'm wondering if the tests should be compacted as well?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53129462
  
    Good Morning @erikerlandson, 
    
    Thank you for telling me the documentation.
    But I can not compile, if I changed the code in ChebyshevDistanceMetric.
    I guess that breeze.numerics doesn't have abs() for a vector.
    Although I read the source code of Breeze, I couldn't find.
    
    ## Sample Change
    ```
    before: val diff = (v1 - v2).map(Math.abs)
    after: val diff = abs(v1 - v2)
    ```
    ## Error Message
    ```
    Error:(99, 19) ambiguous implicit values:
     both method mapUFuncImpl in object Vector of type [Tag, V, U](implicit impl: breeze.generic.UFunc.UImpl[Tag,V,U], implicit canMapValues: breeze.linalg.support.CanMapValues[breeze.linalg.Vector[V],V,U,breeze.linalg.Vector[U]])breeze.generic.UFunc.UImpl[Tag,breeze.linalg.Vector[V],breeze.linalg.Vector[U]]
     and method fromLowOrderCanMapValues in trait MappingUFunc of type [T, V, V2, U](implicit handhold: breeze.linalg.support.CanMapValues.HandHold[T,V], implicit impl: breeze.numerics.package.abs.Impl[V,V2], implicit canMapValues: breeze.linalg.support.CanMapValues[T,V,V2,U])breeze.numerics.package.abs.Impl[T,U]
     match expected type breeze.numerics.abs.Impl[breeze.linalg.Vector[Double],VR]
        val diff = abs(v1 - v2)
                      ^
    ```
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53003052
  
    Given how small these classes are, it seems like it would be efficient to define them all in a single file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53149119
  
    BYW, I checked the performance of Math.abs() and breeze.numerics.abs.
    It seems that Math.abs() performs better than breeze.numerics.abs.
    A performs better than B.
    https://gist.github.com/yu-iskw/518a48a68ef368998058


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52282833
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53347789
  
    Hi @erikerlandson, 
    
    Thank you for your comment.
    I use breeze.numerics.abs instead of Math.abs.
    I tried to check the performance of both again. That of breeze.vector.abs is better than Math.abs.
    https://gist.github.com/yu-iskw/aa56aad7481c4b45192c
    
    Mean of 100 times(breeze.numerics.abs): 964.64
    Mean of 100 times(breeze.Math.abs): 1029.46
    
    And, let me clarify my understanding.
    As commented previously, I can't compile my code, using abs(vector) in breeze.numerics.
    Does Breeze library  have abs() method for a vector?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52524657
  
    It might also be handy to define an implicit conversion from straight functions:
    
    implicit def functionToDistanceMeasure(f: (Vector, Vector)=>Double) = new DistanceMeasure {...}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52554072
  
    It might be useful to distinguish true metrics from 'measures'.   For example, cosine distance is not a true distance metric.  In some algorithms, that difference can matter.   Maybe define a DistanceMetric trait that is a subclass of the DistanceMeasure trait, so spark functions can specify Metric if a true metric is desired.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53006322
  
    Hello @erikerlandson,
    
    Thank you for your comment.
    Its' in the morning in Japan :)
    
    > I don't think a Weighted trait improves the design.
    I agree with you. I removed Weighted trait.
    
    > Given how small these classes are, it seems like it would be efficient to define them all in a single file
    I moved the definitions of classes to DistanceMeassure or DistanceMetric.
    
    Sorry to trouble you.Would you please review it.
    Thanks
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53012573
  
    > one brief comment: with breeze vectors you can say abs(v) instead of v.map(Math.abs(_))
    
    Sorry, I can't find abs API for a vector in Breeze API documentation.
    http://www.scalanlp.org/api/breeze/#breeze.numerics.package$$abs$
    Manhattan distance requres max values for each abs() element in a vector. So I think that we shouldn't use import breeze.numerics.abs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53525367
  
    @mengxr, Will you please review the detail on this pull request.
    Thank you for your trouble in advance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by dlwh <gi...@git.apache.org>.
Github user dlwh commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53598635
  
    We'd happily take a distance ufunc, or perhaps a collection of distance
    ufuncs along the lines of what's here. I'd want it to be more Breeze-like
    than what's in the PR right now.
    
    I agree with Erik that the right thing to do is to allow people to pass in
    their own distance function.
    
    -- David
    
    
    On Wed, Aug 27, 2014 at 7:35 AM, Erik Erlandson <no...@github.com>
    wrote:
    
    > It seems most elegant and scala-like to directly supply the distance
    > function as the parameter, instead of just a 'type' that selects some
    > internal function. If you don't like one of the pre-defined distance
    > functions, write your own and use it as the parameter.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1964#issuecomment-53581400>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53445323
  
    @yu-iskw, I'm in favor of adopting a standardized distance metric class.    How best to proceed is a question of architecture and road map.   I'm interested in @mengxr 's opinions on where to take it from here, as it would have a nontrivial impact on MLLib interfaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54952704
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20038/consoleFull) for   PR 1964 at commit [`6723187`](https://github.com/apache/spark/commit/672318743e45e307983c4b33b03b562735e9a4bb).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54564146
  
    I'm sorry for delay to reply in replying for you.
    Because I didn't concern about Python API, I rethink of the design for distance now.
    Please give me a few days.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52528032
  
    QA tests have started for PR 1964. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18761/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53845417
  
    @mengxr, @dlwh, @erikerlandson, @rnowling, 
    
    Thank you for your feedback.
    I agree with that idea which distance metrics/measures are implemented in Breeze. However, I am  a bit worried about the interface of the distance metrics in Spark.
    
    I could implement sample distance metric for Breeze like below. However, I couldn't thought of the interface for Spark.
    https://gist.github.com/yu-iskw/37ae208c530f7018e048
    
    I expect users can switch distance metric and use their own distance function in some machine learning algorithms in MLlib, such as Kmeans. Because some of them are applied to `RDD[Vector]` as its input data.  I think that users can make their own function metric, using Spark Vector instead of Breeze Vector.  For example, `(v1: Vector, v2: Vector) => Double`. At least, the interface shouldn't depend on Breeze.
    
    ## High Level Use Case Image
    
    ```
    KMeans().setMeasure(EuclideanDistance).run(RDD[Vector])
    KMeans().setMeasure((v1: Vector, v2: Vector) => fun(v1, v2)).run(RDD[Vector])
    ```
    
    I'm sorry if I misunderstood your opinion. If you have any good idea, can you please show me mock code for use case?
    
    (Please be generous in finding my rude expressions,if any. I don't mean such, because I'm not good at English.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53105970
  
    Hi @yu-iskw, 
    Universal element-wise functions (U-Funcs), including element-wise abs(v), are listed here:
    https://github.com/scalanlp/breeze/wiki/Universal-Functions



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by rnowling <gi...@git.apache.org>.
Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53570185
  
    @mengxr , @yu-iskw 
    
    I think it is valuable to contribute distance metrics to Breeze, but not all of the metrics provided by @yu-iskw may be of interest to Breeze.  If MLLib provides its own wrapper, we can call Breeze for what distance metrics are available there and provide our own implementations for others.
    
    There was interest on the mailing list in different distance metrics for KMeans.  I think this PR should be amenable towards a solution for that.  My main complaint is that the distance metrics implementedin this PR expect MLlib Vectors, not Breeze vectors.  Before this is committed, I think we should figure out how to generalize these metrics to Breeze vectors -- maybe add distance(breeze, breeze) functions to @yu-iskw  's implementation or make breeze vectors the default type and provide an implicit way to cast MLlib vectors to Breeze vectors?
    
    Once native support for Breeze vectors is available, we can start work on a high-level API to distance metrics for KMeans and provide an implementation using the code in this PR.  A string-based API may be one option but this would not support distance metrics (e.g., weighted, L-n norms) which require additional parameters.
    
    What do you think?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53348770
  
    One thing I noticed was that this will not compile:
    
        val v = new DenseBreezeVector((1 to 10).map(j => 1.0))
        val a = abs(v) // fails
    
    However this will:
    
        val v = new DenseBreezeVector((1 to 10).toArray.map(j => 1.0))
        val a = abs(v) // works
    
    So, constructing a breeze vector from an Array is important, as opposed to just a Seq.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53008478
  
    @yu-iskw  I took a fast look and that seems good to me.   I will try to review it more tomorrow.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52996580
  
    I think the design can be made somewhat less complex.   I coded up an example here:
    https://gist.github.com/erikerlandson/f4b9b9a5c9469f2d9006
    
    A couple features to note:
    * you can overload the `apply` method for both breeze vectors and scala vectors.  I defaulted the direct implementation to breeze, on the theory that this is where mllib is headed internally, but the overloadings for scala vectors can also be directly coded if there is a need
    * I don't think a Weighted trait improves the design.  If a class wants weights, it can take them as a constructor parameter and then use them.
    * I think there is no need for error checking on vector sizes, since breeze will do the same checking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53008699
  
    one brief comment: with breeze vectors you can say `abs(v)` instead of `v.map(Math.abs(_))`
    
    import breeze.numerics.abs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54953348
  
    @mengxr, @erikerlandson, @rnowling  Sorry for having been kept waiting. I redesigned the interface of distance functions. Here is the full file change.
    https://github.com/apache/spark/pull/1964/files
    
    I agree to keep MLlib's linear algebra lightweight. And we would had better to use Breeze in terms of the performance. However, I think the standardized distance function API helps us to expand MLlib efficiently.  And also, by contributing another distance functions to Breeze, we can use them in Spark.
    
    I am a bit worried about using Breeze directly in Spark. One reason is that we can't absolutely control the release of Breeze. If we want to use a new distance function in Spark, we contribute it to Breeze as I did. Because there is a time difference between being merged and release. We have to wait. That's why we should have a wrapper API for Breeze linear algebra. And the wrapper API helps us to extend another distance function easily.
    
    I think we need two the standardized distance function API. One is for Spark user, the other is for MLlib developer. `(Vector, Vector) => Double` is for Spark user.  `(BV[Double], BV[Double]) => Double ` is for MLlib developer. And Breeze should be hidden for Spark user.  At least, we should have a mapping function from `String` to a distance function because of PySpark API.
    
    We have to use the implementation in a wrapper API as much as we can. That is, you know, I contributed some distance functions to Breeze. However,they have not  been released on Maven repository. That's why we can't use them in Spark yet. If they will be released, I will modify the implementation, such as the  `apply` method of `EuclideanDistanceMetric` class.
    
    ## Diff Summary
    
    - `(BV[Double], BV[Double]) => Double` is for Spark user, such as `EuclideanDistanceMetric` class
    - Companion Object of the distance Function is for MLlib developer, such as `EuclideanDistanceMetric` object
    - Extract WeightedDistanceMetric and its sub class from DistanceMetric
    - Extract WeightedDistanceMeasure and its sub class from DistanceMeasure
    - Add a mapping function in `mllib/src/main/scala/org/apache/spark/mllib/linalg/distance/DistanceMeasureFactory.scala`
    
    thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52848633
  
    I like where this RFE is going.
    
    One comment about "metric" versus "measure" - cosine distance is a subclass of DistanceMeasure, as it is *not* a metric.  Others should be subclasses of DistanceMetric, such as manhattan, L2, etc.  Others of them I'm not sure of, but recommend to verify which ones are true metrics, and which are just measures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52527707
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53936014
  
    @yu-iskw, your use case above with `setMeasure` is definitely how I envisioned it working, in terms of passing in metric functions as parameters.  
    
    I'm somewhat agnostic about exact mechanics, such as whether or not there is a dedicated setMeasure() method, or some other mechanism for passing in the function.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53351219
  
    Hi @erikerlandson,
    
    I appriciate your example code.
    I only use abs(v.toArray) in ChebyshevDistanceMetric.
    But i didn't change in WeightedChebyshevDistanceMetric.
    Because it can not modify in terms of negative weight(s).
     
    ```
    - val diff = (v1 - v2).map(elm => Math.abs(elm)).:*(weights)
    
    + val diff = abs((v1 - v2).toArray) .:*(weights)  // can not compile
    
    + val diff = abs(weights.:*(v1 - v2)) // fault result if weights has negative value
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52858495
  
    I feel ambiguous about whether there needs to be a sub-trait for WeightedDistanceMeasure, but if we have that then we will also need a WeightedDistanceMetric as a subtrait of DistanceMetric.
    
    or as an alternative just define a trait Weighted, and then:
    
    class WeightedEuclideanDistanceMetric(val weight: Vector) extends DistsanceMetric with Weighted



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw closed the pull request at:

    https://github.com/apache/spark/pull/1964


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52756284
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-55825405
  
    Hi @mengxr, @erikerlandson, @rnowling
    Sorry to comment again. Could you review it?
    thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54802662
  
    Jenkins, ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52943537
  
    Hi @erikerlandson and @rnowling,
    
    Thank you for your advice.
    I modified my code. Would you please review it.
    
    - Improve the performance of calculating distance metric to use Breeze
    - Add Weighted trait for general weighted distance metric
    ex) class WeightedEuclideanDistanceMetric(val weight: Vector) extends DistsanceMetric with Weighted
    
    I think it's a bit tricky to calculate distance metric because of Weighted trait.
    I'd like to get your thoughts on it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53378581
  
    If  the interface of distance function is valid, merging this issue, and then trying the performance improvement in another issue would be a good idea.
    Since the interface is merged, we can tackle  many algorithms, such as k-means or fuzzy c-means based on the interface and the current implementation, which of course can be replaced.
    I'd like to get your thoughts on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54814790
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19976/consoleFull) for   PR 1964 at commit [`a44c30c`](https://github.com/apache/spark/commit/a44c30c2b8a549b04c57bb3cda3c46cb80495590).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53567544
  
    @mengxr  Thank you for your comment.
    
    I expect users are able to pick different distance measure in clustering algorithm, such as k-means. 
    And I agree with you in your view. We should hide the distance function implementation from users.
    But users should easily specify the distance type in clustering algorithm arguments.
    
    Would you please point out any mistakes to me in my code.
    If there is any problems in terms of the interface of distance function, I would like to merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54694468
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54254431
  
    @yu-iskw @erikerlandson @dlwh I prefer simple types for parameters for model serialization and consistent APIs across languages. In a predictive model, we should store the training parameters that used to train this model, and it would be nice to use simple-typed parameters. Another concern is Python API. If we pass in a distance implementation, we also need to define its Python counterpart for API consistency, which is not needed by PySpark's k-means because it calls Scala's implementation through serialization.
    
    For Spark's k-means, it should be good enough to support common and predefined distance measures, via Breeze.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53108224
  
    the implicit conversion function doesn't have to throw an exception, it just needs to override the breeze vector overloading and inherit the default spark vector overloading as usual:
    
        implicit def functionToDistanceMeasure(f: (BV[Double], BV[Double]) => Double): DistanceMeasure = new DistanceMeasure {
             override def apply(v1: BV[Double], v2: BV[Double]): Double = f(v1, v2)
        }
    
    I guess there also needs to be a `functionToDistanceMetric`
    
    Aside from that I think it is looking good.  Thank you for your work on this!
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54255195
  
    @yu-iskw Thanks for contributing to Breeze! I linked this JIRA to SPARK-3219, which tries to generalize k-means to support more distance measures, and we will try to use your implementation there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53013311
  
    @rnowling
    
    Thank you for your comment.
    
    - Remove SquaredEuclideanDistanceMeasure.scala
    - Aggregate tests to reduce the number of files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52375350
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52857232
  
    Looking at the conversion code between spark vectors and breeze vectors, conversions seem reasonably inexpensive, as they are just exchanging their underlying data members.   It might be nice to migrate to using all breeze vectors instead of a mixture of spark and breeze, although that would be outside the scope of this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by yu-iskw <gi...@git.apache.org>.
Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52994477
  
    Sorry for immediate modification.
    
    - CosineDistanceMeasure and TanimotoDistanceMeasure are reverted before 64f7389
    -- Because both classes can not be applied Weighted trait
    - Rename TanimotoDistanceMetric class to TanimotoDistanceMeasure
    -- Tanimoto distance is not a metric


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by rnowling <gi...@git.apache.org>.
Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52924889
  
    @erikerlandson Breeze uses non-JVM data structures so it can use BLAS. The malloc and copying could be both expensive and lead to 2-3x higher memory usage.
    
    This may be an artifact of the MLlib vector API and may be outside the scope of this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-53341467
  
    You want to use abs(vector) instead of vector.map(abs(_)):
    https://gist.github.com/erikerlandson/488275019e4b54f5cdaf
    
    With the test harness above, I am getting breeze abs results around 2x faster, although the variance between test runs can be substantial.   I'm interested if you get comparable results running it on your environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by erikerlandson <gi...@git.apache.org>.
Github user erikerlandson commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52516267
  
    I'm wondering if it might be simpler and more idiomatic to just define distance measure directly as any subclass of Function2, like:
    
    trait DistanceMeasure extends Function2[Vector, Vector, Double] with Serializable
    
    Then you could also just use "distanceObject(v1,v2)" syntax to get distance between v1 and v2
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

Posted by rnowling <gi...@git.apache.org>.
Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-52810571
  
    Great work!
    
    If a class implements the DistanceMetric trait, it should probably be renamed ---DistanceMetric.  For example, CosineDistanceMeasure should be CosineDistanceMetric.
    
    From a performance standpoint, I'm concerned about the vector conversions.  KMeans converts all vectors to BreezeVectors internally, so using these metrics would require BreezeVectors -> Vectors -> BreezeVectors.  CosineDistanceMeasure actually converts the same vector (v1) twice instead of saving the conversion for reuse!
    
    Since the Breeze library should perform vector operations in a more efficient manner, Breeze should be used in place of map where possible.  E.g., WeightedEuclideanDistanceMeasure. Was map used there because Breeze doesn't support element-wise operations?
    
    I look forward to seeing this included in mllib!
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org