You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Hao (JIRA)" <ji...@apache.org> on 2015/08/20 07:28:47 UTC

[jira] [Comment Edited] (SPARK-9357) Remove JoinedRow

    [ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704311#comment-14704311 ] 

Cheng Hao edited comment on SPARK-9357 at 8/20/15 5:28 AM:
-----------------------------------------------------------

JoinedRow does increase the overhead by adding layer of indirection, however, it is a trade-off, as copying the 2 non-continue pieces of memory together is also causes performance issue, particularly the case I listed above, only a few records really need by the downstream operators(writing to files) after the filtering.

The n-ary JoinedRow will be really helpful in case like the sequential joins. For example:
{code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and a.col1>b.col1 and b.col2<c.col3 {code}

As we know the join keys are exactly the same, and the join operators compute the result in the same stage, and then the intermediate row(like below), will be sent to operator Filter for predicating the (a.col1>b.col1 and b.col2<c.col3);
{noformat}
          JoinedRow(a,b,c)
              /      \
JoinedRow(a, b)    row(c)
        /   \
row(a)    row(b)
{noformat}
It's probably more helpful if we can codegen a row layer for supporting the n-ary joined rows, Definitely save memory copyings, as we never copy even a single byte at all.


was (Author: chenghao):
JoinedRow does increase the overhead by adding layer of indirection, however, it is a trade-off, as copying the 2 non-continue pieces of memory together is also causes performance issue, particularly the case I listed above, only a few records really need by the downstream operators(writing to files) after the filtering.

The n-ary JoinedRow will be really helpful in case like the sequential joins. For example:
{code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and a.col1>b.col1 and b.col2<c.col3 {code}

As we know the join keys is exactly the same, and the join operators will compute the result in the same stage, and the intermediate row probably like below, and the intermediate row will be sent to operator Filter for computing the (a.col1>b.col1 and b.col2<c.col3);
{noformat}
          JoinedRow(a,b,c)
              /      \
JoinedRow(a, b)    row(c)
        /   \
row(a)    row(b)
{noformat}

I am thinking if it's be more helpful if we can codegen a row layer for supporting the n-ary joined rows, it's definitely save memory copyings, as we never copy even a single byte at all.

> Remove JoinedRow
> ----------------
>
>                 Key: SPARK-9357
>                 URL: https://issues.apache.org/jira/browse/SPARK-9357
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>            Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row is actually read. Given all the fields will be read almost all the time (otherwise they get pruned out by the optimizer), branch predictor cannot do anything about those branches.
> I think a better way is just to remove this thing, and materializes the row data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org