You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Abou Haydar Elias (JIRA)" <ji...@apache.org> on 2015/07/16 15:03:04 UTC

[jira] [Commented] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala

    [ https://issues.apache.org/jira/browse/SPARK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629685#comment-14629685 ] 

Abou Haydar Elias commented on SPARK-9098:
------------------------------------------

This issues creates an inconsistency in the API. 

So I totally agree with [~zero323] with enforcing immutability and providing a meaningful hashing. That can be a good approach.

> Inconsistent Dense Vectors hashing between PySpark and Scala
> ------------------------------------------------------------
>
>                 Key: SPARK-9098
>                 URL: https://issues.apache.org/jira/browse/SPARK-9098
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark
>    Affects Versions: 1.3.1, 1.4.0
>            Reporter: Maciej Szymkiewicz
>            Priority: Minor
>
> When using Scala  it is possible to group a RDD using DenseVector as a key:
> {code}
> import org.apache.spark.mllib.linalg.Vectors
> val rdd = sc.parallelize(
>     (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil)
> rdd.groupByKey.count
> {code}
> returns 1 as expected.
> In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the {{object}} and based on memory address:
> {code}
> from pyspark.mllib.linalg import DenseVector
> rdd = sc.parallelize(
>     [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)])
> rdd.groupByKey().count()
> {code}
> returns 2.
> Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing doesn't look meaningful at all:
> {code}
> >>> dv = DenseVector([1, 2, 3])
> >>> hdv1 = hash(dv)
> >>> dv.array[0] = 3.0
> >>> hdv2 = hash(dv)
> >>> hdv1 == hdv2
> True
> >>> dv == DenseVector([1, 2, 3])
> False
> {code}
> In my opinion the best approach would be to enforce immutability and provide a meaningful hashing. An alternative is to make {{DenseVector}} unhashable same as {{numpy.ndarray}}.
> Source: http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org