You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (JIRA)" <ji...@apache.org> on 2015/07/16 14:19:05 UTC
[jira] [Created] (SPARK-9098) Inconsistent Dense Vectors hashing
between PySpark and Scala
Maciej Szymkiewicz created SPARK-9098:
-----------------------------------------
Summary: Inconsistent Dense Vectors hashing between PySpark and Scala
Key: SPARK-9098
URL: https://issues.apache.org/jira/browse/SPARK-9098
Project: Spark
Issue Type: Improvement
Components: MLlib, PySpark
Affects Versions: 1.4.0, 1.3.1
Reporter: Maciej Szymkiewicz
Priority: Minor
When using Scala it is possible to group a RDD using DenseVector as a key:
{code}
import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(
(Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil)
rdd.groupByKey.count
{code}
returns 1 as expected.
In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the {{object}} and based on memory address:
{code}
from pyspark.mllib.linalg import DenseVector
rdd = sc.parallelize(
[(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)])
rdd.groupByKey().count()
{code}
returns 2.
Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing doesn't look meaningful at all:
{code}
>>> dv = DenseVector([1, 2, 3])
>>> hdv1 = hash(dv)
>>> dv.array[0] = 3.0
>>> hdv2 = hash(dv)
>>> hdv1 == hdv2
True
>>> dv == DenseVector([1, 2, 3])
False
{code}
In my opinion the best approach would be to enforce immutability and provide a meaningful hashing. An alternative is to make {{DenseVector}} unhashable same as {{numpy.ndarray}}.
Source: http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org