You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/05/31 07:46:17 UTC
[jira] [Updated] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

     [ https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-7197:
-------------------------------
    Description: 
It looks like join with DataFrames API in python does not return correct results if using more 2 or more columns.  The example in the documentation
only shows a single column.

Here is an example to show the problem:

****Example code****
{code}
import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print "Pandas"  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print "Spark"
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
{code}
*****Output****
{code}
Pandas
  month  value  year
0     5    100  1993
1    12    200  2005
2    12    300  1994

  month  value  year
0    12    101  1993
1    12    102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0     5    100  1993
1    12    200  2005
2    12    300  1994

  month  value  year
0    12    101  1993
1    12    102  1993

 month  value  year month  value  year
0    12    200  2005    12    102  1993
1    12    200  2005    12    101  1993
2    12    300  1994    12    102  1993
3    12    300  1994    12    101  1993
{code}

It looks like Spark returns some results where an inner join should
return nothing.

Confirmed on user mailing list as an issue with Ayan Guha.

  was:
It looks like join with DataFrames API in python does not return correct results if using more 2 or more columns.  The example in the documentation
only shows a single column.

Here is an example to show the problem:

****Example code****

import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print "Pandas"  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print "Spark"
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()

*****Output****

Pandas
  month  value  year
0     5    100  1993
1    12    200  2005
2    12    300  1994

  month  value  year
0    12    101  1993
1    12    102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0     5    100  1993
1    12    200  2005
2    12    300  1994

  month  value  year
0    12    101  1993
1    12    102  1993

 month  value  year month  value  year
0    12    200  2005    12    102  1993
1    12    200  2005    12    101  1993
2    12    300  1994    12    102  1993
3    12    300  1994    12    101  1993

It looks like Spark returns some results where an inner join should
return nothing.

Confirmed on user mailing list as an issue with Ayan Guha.


> Join with DataFrame Python API not working properly with more than 1 column
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-7197
>                 URL: https://issues.apache.org/jira/browse/SPARK-7197
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.3.1
>            Reporter: Ali Bajwa
>
> It looks like join with DataFrames API in python does not return correct results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> ****Example code****
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *****Output****
> {code}
> Pandas
>   month  value  year
> 0     5    100  1993
> 1    12    200  2005
> 2    12    300  1994
>   month  value  year
> 0    12    101  1993
> 1    12    102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0     5    100  1993
> 1    12    200  2005
> 2    12    300  1994
>   month  value  year
> 0    12    101  1993
> 1    12    102  1993
>  month  value  year month  value  year
> 0    12    200  2005    12    102  1993
> 1    12    200  2005    12    101  1993
> 2    12    300  1994    12    102  1993
> 3    12    300  1994    12    101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org