You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Josh Rosen <jo...@databricks.com> on 2017/05/01 17:52:25 UTC

Re: New Optimizer Hint

The issue of UDFS which return structs being evaluated many times when
accessing the returned struct's fields sounds like
https://issues.apache.org/jira/browse/SPARK-17728; that issue mentions a
trick of using *array* and *explode* to prevent project collapsing.

On Thu, Apr 20, 2017 at 8:55 AM Reynold Xin <rx...@databricks.com> wrote:

> Doesn't common sub expression elimination address this issue as well?
>
> On Thu, Apr 20, 2017 at 6:40 AM Herman van Hövell tot Westerflier <
> hvanhovell@databricks.com> wrote:
>
>> Hi Michael,
>>
>> This sounds like a good idea. Can you open a JIRA to track this?
>>
>> My initial feedback on your proposal would be that you might want to
>> express the no_collapse at the expression level and not at the plan
>> level.
>>
>> HTH
>>
>> On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles <
>> michael.styles@shopify.com> wrote:
>>
>>> Hello,
>>>
>>> I am in the process of putting together a PR that introduces a new hint
>>> called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE
>>> hint.
>>>
>>> Let me first give an example of why I am proposing this.
>>>
>>> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
>>> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"]))
>>> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
>>> df2["ua"].browser_version.alias("c2"))
>>> df3.explain(True)
>>>
>>> == Parsed Logical Plan ==
>>> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS
>>> c2#91]
>>> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>>>    +- LogicalRDD [id#80L, user_agent#81]
>>>
>>> == Analyzed Logical Plan ==
>>> c1: string, c2: string
>>> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS
>>> c2#91]
>>> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>>>    +- LogicalRDD [id#80L, user_agent#81]
>>>
>>> == Optimized Logical Plan ==
>>> Project [UDF(user_agent#81).device_form_factor AS c1#90,
>>> UDF(user_agent#81).browser_version AS c2#91]
>>> +- LogicalRDD [id#80L, user_agent#81]
>>>
>>> == Physical Plan ==
>>> *Project [UDF(user_agent#81).device_form_factor AS c1#90,
>>> UDF(user_agent#81).browser_version AS c2#91]
>>> +- Scan ExistingRDD[id#80L,user_agent#81]
>>>
>>> user_agent_details is a user-defined function that returns a struct. As
>>> can be seen from the generated query plan, the function is being executed
>>> multiple times which could lead to performance issues. This is due to the
>>> CollapseProject optimizer rule that collapses adjacent projections.
>>>
>>> I'm proposing a hint that prevent the optimizer from collapsing adjacent
>>> projections. A new function called 'no_collapse' would be introduced for
>>> this purpose. Consider the following example and generated query plan.
>>>
>>> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
>>> df2 = F.no_collapse(df1.withColumn("ua",
>>> user_agent_details(df1["user_agent"])))
>>> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
>>> df2["ua"].browser_version.alias("c2"))
>>> df3.explain(True)
>>>
>>> == Parsed Logical Plan ==
>>> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS
>>> c2#76]
>>> +- NoCollapseHint
>>>    +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>>>       +- LogicalRDD [id#64L, user_agent#65]
>>>
>>> == Analyzed Logical Plan ==
>>> c1: string, c2: string
>>> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>>> c2#76]
>>> +- NoCollapseHint
>>>    +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>>>       +- LogicalRDD [id#64L, user_agent#65]
>>>
>>> == Optimized Logical Plan ==
>>> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>>> c2#76]
>>> +- NoCollapseHint
>>>    +- Project [UDF(user_agent#65) AS ua#69]
>>>       +- LogicalRDD [id#64L, user_agent#65]
>>>
>>> == Physical Plan ==
>>> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
>>> c2#76]
>>> +- *Project [UDF(user_agent#65) AS ua#69]
>>>    +- Scan ExistingRDD[id#64L,user_agent#65]
>>>
>>> As can be seen from the query plan, the user-defined function is now
>>> evaluated once per row.
>>>
>>> I would like to get some feedback on this proposal.
>>>
>>> Thanks.
>>>
>>>
>>
>>
>> --
>>
>> Herman van Hövell
>>
>> Software Engineer
>>
>> Databricks Inc.
>>
>> hvanhovell@databricks.com
>>
>> +31 6 420 590 27
>>
>> databricks.com
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>>
>> [image: Join Databricks at Spark Summit 2017 in San Francisco, the
>> world's largest event for the Apache Spark community.]
>> <http://ssum.it/2mKQ3te>
>>
>