You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Maciej Szymkiewicz (JIRA)" <ji...@apache.org> on 2015/07/16 14:19:05 UTC

[jira] [Created] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala

Maciej Szymkiewicz created SPARK-9098:
-----------------------------------------

             Summary: Inconsistent Dense Vectors hashing between PySpark and Scala
                 Key: SPARK-9098
                 URL: https://issues.apache.org/jira/browse/SPARK-9098
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
    Affects Versions: 1.4.0, 1.3.1
            Reporter: Maciej Szymkiewicz
            Priority: Minor


When using Scala  it is possible to group a RDD using DenseVector as a key:

{code}
import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(
    (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil)
rdd.groupByKey.count
{code}

returns 1 as expected.

In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the {{object}} and based on memory address:

{code}
from pyspark.mllib.linalg import DenseVector

rdd = sc.parallelize(
    [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)])
rdd.groupByKey().count()
{code}

returns 2.

Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing doesn't look meaningful at all:

{code}
>>> dv = DenseVector([1, 2, 3])
>>> hdv1 = hash(dv)
>>> dv.array[0] = 3.0
>>> hdv2 = hash(dv)
>>> hdv1 == hdv2
True
>>> dv == DenseVector([1, 2, 3])
False
{code}

In my opinion the best approach would be to enforce immutability and provide a meaningful hashing. An alternative is to make {{DenseVector}} unhashable same as {{numpy.ndarray}}.


Source: http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org