You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Patterson (JIRA)" <ji...@apache.org> on 2016/10/19 19:04:58 UTC

[jira] [Created] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

Michael Patterson created SPARK-18014:
-----------------------------------------

             Summary: Filters are incorrectly being grouped together when there is processing in between
                 Key: SPARK-18014
                 URL: https://issues.apache.org/jira/browse/SPARK-18014
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2.0.1
         Environment: Pyspark 2.0.1, Ipython 4.2
            Reporter: Michael Patterson
            Priority: Minor


I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement.

Example code to reproduce:
```import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
    return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
      +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing the above code in pyspark 1.6.2 yields
+-----+------+
|input|output|
+-----+------+
|    2|second|
+-----+------+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org