You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jarno Seppanen (JIRA)" <ji...@apache.org> on 2016/08/24 07:53:20 UTC
[jira] [Created] (SPARK-17211) Broadcast join produces incorrect
results
Jarno Seppanen created SPARK-17211:
--------------------------------------
Summary: Broadcast join produces incorrect results
Key: SPARK-17211
URL: https://issues.apache.org/jira/browse/SPARK-17211
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.0.0
Reporter: Jarno Seppanen
Broadcast join produces incorrect columns in join result, see below for an example. The same join but without using broadcast gives the correct columns.
Running PySpark on YARN on Amazon EMR 5.0.0.
{noformat}
import pyspark.sql.functions as func
keys = [
(54000000, 0),
(54000001, 1),
(54000002, 2),
]
keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
keys_df.show()
# +--------+-----+
# | key_id|value|
# +--------+-----+
# |54000000| 0|
# |54000001| 1|
# |54000002| 2|
# +--------+-----+
data = [
(54000002, 1),
(54000000, 2),
(54000001, 3),
]
data_df = spark.createDataFrame(data, ['key_id', 'foo'])
data_df.show()
# +--------+---+
# | key_id|foo|
# +--------+---+
# |54000002| 1|
# |54000000| 2|
# |54000001| 3|
# +--------+---+
### INCORRECT ###
data_df.join(func.broadcast(keys_df), 'key_id').show()
# +--------+---+--------+
# | key_id|foo| value|
# +--------+---+--------+
# |54000002| 1|54000002|
# |54000000| 2|54000000|
# |54000001| 3|54000001|
# +--------+---+--------+
### CORRECT ###
data_df.join(keys_df, 'key_id').show()
# +--------+---+-----+
# | key_id|foo|value|
# +--------+---+-----+
# |54000000| 2| 0|
# |54000001| 3| 1|
# |54000002| 1| 2|
# +--------+---+-----+
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org