You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/08/03 22:03:04 UTC

[jira] [Commented] (SPARK-9357) Remove JoinedRow

    [ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652390#comment-14652390 ] 

Herman van Hovell commented on SPARK-9357:
------------------------------------------

+1 for removing this.

The {{AlgebraicAggregate}} part of the new UDAF interfaces uses this a lot in very performance critical sections. I ran into this doing some benchmarking for the SPARK-8641 ticket. After some profiling I found out that a significant amount of time is spent in JoinedRow; It causes a major performance regression in some cases.

I have been doing some experimentation:
* After a discussion with [~yhuai] I tried removing the branching in the joined row. This improved the situation by 10%.
* I created specialized {{BoundReference}}'s which bind directly to the {{row1}} or {{row2}} values of the JoinedRow (exposed those through getters). This worked wonders. Performance is much better in most cases now. In the end I think it is best to explicitly start to support a {{JoinProjection}} which takes a left and a right row as input and produces an output row, and change {{SparkPlan}} and CG accordingly. I think we'd still need JoinedRow in the interpreted case though. 

I can turn the POC I have made into a PR for discussion.

> Remove JoinedRow
> ----------------
>
>                 Key: SPARK-9357
>                 URL: https://issues.apache.org/jira/browse/SPARK-9357
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>            Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row is actually read. Given all the fields will be read almost all the time (otherwise they get pruned out by the optimizer), branch predictor cannot do anything about those branches.
> I think a better way is just to remove this thing, and materializes the row data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org