You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/28 06:31:21 UTC

[GitHub] [spark] rednaxelafx commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions

rednaxelafx commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
URL: https://github.com/apache/spark/pull/20965#issuecomment-525603727
 
 
   Thanks for working on this PR, @maropu ! The new generated code looks a lot better!
   I'll go into more detailed review soon. Three quick questions:
   1. although it probably won't change your example generated code, but for known not-null values extracted from locals/CSE state, when you're constructing the split-off functions, could you make sure that you don't pass redundant `isNull` argument into the split-off function? That could save a lot for each not-null value you pass down. It's sort of like how @viirya 's construct consume function works.
   2. Can we be more precise about which aggregate functions are referencing which extracted locals, and only pass relevant locals down into the split-off aggregate function? If you have a lot of different things, e.g. `sum(a), avg(a), sum(b), avg(b), sum(c), avg(c)`, then if we don't make it more precise, we'd be passing a long list of stuff down to each split-off function for no good.
   3. Can we only trigger the actual split after some certain threshold of the total generated code for aggregate functions is hit? i.e. when it's just a couple of short aggregate functions, generate your new code but don't split off into separate functions; when the total size of the generated code for aggregate function go above some threshold, trigger your actual split. The conf for controlling whether or not to split can be put along side this threshold check, something like: `if (confEnabled && totalSize > threshold) splitAggregateBlahBlahBlah(generatedAggregateFuncCodes) else generatedAggregateFuncCodes.mkString("\n")`
   
   Another thing I'd like to point out with the updated PR description is that, under the "generated code in the current master" line, the example generated code doesn't look right: it looks like you're quoting the generated code from the final aggregate instead of the partial aggregate, because the inputs look like aggregate buffer slot values to me. You intended to quote the partial aggregate code, right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org