You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ali Bajwa (JIRA)" <ji...@apache.org> on 2015/04/28 18:46:06 UTC
[jira] [Created] (SPARK-7197) Join with DataFrame Python API not
working properly with more than 1 column
Ali Bajwa created SPARK-7197:
--------------------------------
Summary: Join with DataFrame Python API not working properly with more than 1 column
Key: SPARK-7197
URL: https://issues.apache.org/jira/browse/SPARK-7197
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 1.3.1
Reporter: Ali Bajwa
It looks like join with DataFrames API in python does not return correct results if using more 2 or more columns. The example in the documentation
only shows a single column.
Here is an example to show the problem:
****Example code****
import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)
print "Pandas" # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')
print "Spark"
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
*****Output****
Pandas
month value year
0 5 100 1993
1 12 200 2005
2 12 300 1994
month value year
0 12 101 1993
1 12 102 1993
Empty DataFrame
Columns: [month, value_x, year, value_y]
Index: []
Spark
month value year
0 5 100 1993
1 12 200 2005
2 12 300 1994
month value year
0 12 101 1993
1 12 102 1993
month value year month value year
0 12 200 2005 12 102 1993
1 12 200 2005 12 101 1993
2 12 300 1994 12 102 1993
3 12 300 1994 12 101 1993
It looks like Spark returns some results where an inner join should
return nothing.
Confirmed on user mailing list as an issue with Ayan Guha.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org