You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jarno Seppanen (JIRA)" <ji...@apache.org> on 2016/08/24 07:53:20 UTC

[jira] [Created] (SPARK-17211) Broadcast join produces incorrect results

Jarno Seppanen created SPARK-17211:
--------------------------------------

             Summary: Broadcast join produces incorrect results
                 Key: SPARK-17211
                 URL: https://issues.apache.org/jira/browse/SPARK-17211
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Jarno Seppanen


Broadcast join produces incorrect columns in join result, see below for an example. The same join but without using broadcast gives the correct columns.

Running PySpark on YARN on Amazon EMR 5.0.0.

{noformat}

import pyspark.sql.functions as func

keys = [
    (54000000, 0),
    (54000001, 1),
    (54000002, 2),
]

keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
keys_df.show()
# +--------+-----+
# |  key_id|value|
# +--------+-----+
# |54000000|    0|
# |54000001|    1|
# |54000002|    2|
# +--------+-----+

data = [
    (54000002,    1),
    (54000000,    2),
    (54000001,    3),
]

data_df = spark.createDataFrame(data, ['key_id', 'foo'])
data_df.show()
# +--------+---+                                                                  
# |  key_id|foo|
# +--------+---+
# |54000002|  1|
# |54000000|  2|
# |54000001|  3|
# +--------+---+

### INCORRECT ###

data_df.join(func.broadcast(keys_df), 'key_id').show()
# +--------+---+--------+                                                         
# |  key_id|foo|   value|
# +--------+---+--------+
# |54000002|  1|54000002|
# |54000000|  2|54000000|
# |54000001|  3|54000001|
# +--------+---+--------+

### CORRECT ###

data_df.join(keys_df, 'key_id').show()
# +--------+---+-----+
# |  key_id|foo|value|
# +--------+---+-----+
# |54000000|  2|    0|
# |54000001|  3|    1|
# |54000002|  1|    2|
# +--------+---+-----+
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org