You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lovasoa (JIRA)" <ji...@apache.org> on 2017/05/31 20:24:04 UTC
[jira] [Updated] (SPARK-20939) Do not duplicate user-defined
functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lovasoa updated SPARK-20939:
----------------------------
Description:
Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that
{{
Filter UDF(a)
+- Join Inner, (a = b)
+- Relation
+- Relation
}}
becomes
{{
Join Inner, (a = b)
+- Filter UDF(a)
+- Relation
+- Filter UDF(b)
+- Relation
}}
In general, it is a good thing to push down filters as it reduces the number of records that will go through the join.
However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not.
So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to.
See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark
was:
Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that
Filter UDF(a)
+- Join Inner, (a = b)
+- Relation
+- Relation
becomes
Join Inner, (a = b)
+- Filter UDF(a)
+- Relation
+- Filter UDF(b)
+- Relation
In general, it is a good thing to push down filters as it reduces the number of records that will go through the join.
However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not.
So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to.
See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark
> Do not duplicate user-defined functions while optimizing logical query plans
> ----------------------------------------------------------------------------
>
> Key: SPARK-20939
> URL: https://issues.apache.org/jira/browse/SPARK-20939
> Project: Spark
> Issue Type: Bug
> Components: Optimizer, SQL
> Affects Versions: 2.1.0
> Reporter: Lovasoa
> Priority: Minor
> Labels: logical_plan, optimizer
>
> Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that
> {{
> Filter UDF(a)
> +- Join Inner, (a = b)
> +- Relation
> +- Relation
> }}
> becomes
> {{
> Join Inner, (a = b)
> +- Filter UDF(a)
> +- Relation
> +- Filter UDF(b)
> +- Relation
> }}
> In general, it is a good thing to push down filters as it reduces the number of records that will go through the join.
> However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not.
> So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to.
> See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org