You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/07/08 18:40:59 UTC

[GitHub] [spark] erikerlandson edited a comment on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

erikerlandson edited a comment on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#issuecomment-509339589
 
 
   @rxin the key difference is in the `update` methods. The standard UDAF requires that the aggregator be stored in a MutableAggregationBuffer and so a UDAF update method always has this basic form:
   ```scala
   def update(buf: MutableAggregationBuffer, input: Row): Unit = {
     val agg = buf.getAs[AggregatorType](0)  // UDT deserializes the aggregator from 'buf'
     agg.update(input)    // update the state of your aggregation
     buf(0) = agg    // UDT re-serializes the aggregator back into buf
   }
   ```
   The consequence of this is that it is calling deserialize and (re)serialize for the actual aggregating structure for every single input row. If your dataframe has a million rows, it's doing ser/de on your aggregator a million times, not just at the end of each data partition.
   
   Compare that with the UDAI (which is driven by TypedImperativeAggregate)
   ```scala
   def update(agg: AggregatorType, input: Row): AggregatorType = {
     agg.update(input) // update the state of your aggregator from the input
     agg // return the aggregator
   }
   ```
   You can see that here, there is no ser/de of the aggregator at all, when processing input rows (which is as it should be).  The TypedImperativeAggregate only invokes ser/de on the aggregator when it is collecting partial results across partitions (and at the end when it is presenting final results into the output data frame).
   
   So, imagine a data-frame with 10 partitions and 1 million rows.  The UDAF does ser/de on the aggregator a million (plus 10) times, while the UDIA does ser/de only 10 times.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org