You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "peay (JIRA)" <ji...@apache.org> on 2016/11/16 20:07:58 UTC
[jira] [Created] (SPARK-18473) Correctness issue in INNER join result with window functions

peay created SPARK-18473:
----------------------------

             Summary: Correctness issue in INNER join result with window functions
                 Key: SPARK-18473
                 URL: https://issues.apache.org/jira/browse/SPARK-18473
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core, SQL
    Affects Versions: 2.0.1
            Reporter: peay


I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-----+---------+------+--------+--------+----------+------+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-----+---------+------+--------+--------+----------+------+
|    1|        1|     1|       0|       1|         0|     1|
|    2|        2|     0|       0|       1|         0|     1|
|    1|        3|     1|       0|       2|         0|     2|
+-----+---------+------+--------+--------+----------+------+
{code}

with

{code}
+------+
|sessId|
+------+
|     1|
|     2|
+------+
{code}

The result is

{code}
+------+-----+---------+------+--------+--------+----------+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+------+-----+---------+------+--------+--------+----------+
|     1|    2|        2|     0|       0|       1|         0|
|     2|    1|        1|     1|       0|       1|        -1|
|     2|    1|        3|     1|       0|       2|         0|
+------+-----+---------+------+--------+--------+----------+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  `window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
    pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
    df1
    .withColumn("hasOne", (col("index") == 1).cast("int"))
    .withColumn("hasFifty", (col("index") == 50).cast("int"))
    .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
    .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
    .fillna({ 'numFiftyStrictlyBefore': 0 })
    .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org