You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matei Zaharia (JIRA)" <ji...@apache.org> on 2014/06/03 22:34:07 UTC

[jira] [Resolved] (SPARK-1468) The hash method used by partitionBy in Pyspark doesn't deal with None correctly.

     [ https://issues.apache.org/jira/browse/SPARK-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia resolved SPARK-1468.
----------------------------------

    Resolution: Fixed

> The hash method used by partitionBy in Pyspark doesn't deal with None correctly.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-1468
>                 URL: https://issues.apache.org/jira/browse/SPARK-1468
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.0
>            Reporter: Erik Selin
>            Assignee: Erik Selin
>             Fix For: 0.9.2, 1.0.1
>
>
> In python the default hash method uses the memory address of objects. Since None is an object None will get partitioned into different partitions depending on which python process it is run in. This causes some really odd results when None key's are used in the partitionBy.
> I've created a fix using a consistent hashing method that sends None to 0. That pr lives at https://github.com/apache/spark/pull/371



--
This message was sent by Atlassian JIRA
(v6.2#6252)