You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2019/01/04 17:26:00 UTC
[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression

     [ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-26366:
--------------------------------
    Comment: was deleted

(was: mgaido91 opened a new pull request #23372: [SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False
URL: https://github.com/apache/spark/pull/23372
 
 
   ## What changes were proposed in this pull request?
   
   In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.
   
   The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.
   
   The PR fixes these problem by:
    - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
    - avoiding any transformation when the condition is non-deterministic.
   
   ## How was this patch tested?
   
   added UTs
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
)

> Except with transform regression
> --------------------------------
>
>                 Key: SPARK-26366
>                 URL: https://issues.apache.org/jira/browse/SPARK-26366
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.2
>            Reporter: Dan Osipov
>            Assignee: Marco Gaido
>            Priority: Major
>              Labels: correctness
>             Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
>     Row("0", "john", "smith", "john@smith.com"),
>     Row("1", "jane", "doe", "jane@doe.com"),
>     Row("2", "apache", "spark", "spark@apache.org"),
>     Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
>     StructField("id", StringType, nullable=true),
>     StructField("first_name", StringType, nullable=true),
>     StructField("last_name", StringType, nullable=true),
>     StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>       (
>         col("first_name").isin(Seq("john", "jane"): _*)
>           and col("last_name").isin(Seq("smith", "doe"): _*)
>       )
>       or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|spark@apache.org|
> | 3| foo| bar| null|
> +---+----------+---------+----------------+{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+----------+---------+----------------+
> | id|first_name|last_name| email|
> +---+----------+---------+----------------+
> | 2| apache| spark|spark@apache.org|
> +---+----------+---------+----------------+{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org