You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "dciborow (JIRA)" <ji...@apache.org> on 2018/01/28 22:59:00 UTC

[jira] [Commented] (SPARK-17949) Introduce a JVM object based aggregate operator

    [ https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342766#comment-16342766 ] 

dciborow commented on SPARK-17949:
----------------------------------

I am seeing the following frequently in my log. Reading through I generally understand the feature, and what is happening. Are there any particular actions I should take when I see this a lot?

I see three actions I could take...

A) Disable Hash Feature

B) Decrease Fallback Threshold

C) Increase Fallback Threshold

 

Before I go randomly trying things and testing performance would appreciate a better understanding of how to handle this situation. 

LOG: 

18/01/28 22:56:26 INFO ObjectAggregationIterator: Aggregation hash map reaches threshold capacity (128 entries), spilling and falling back to sort based aggregation. You may change the threshold by adjust option spark.sql.objectHashAggregate.sortBased.fallbackThreshold

> Introduce a JVM object based aggregate operator
> -----------------------------------------------
>
>                 Key: SPARK-17949
>                 URL: https://issues.apache.org/jira/browse/SPARK-17949
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Cheng Lian
>            Priority: Major
>              Labels: releasenotes
>             Fix For: 2.2.0
>
>         Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, Unsafe (on raw bytes) is not a good programming abstraction due to the lack of structs.
> The idea here is to introduce a JVM object based hash aggregate operator that can support the aforementioned use cases. This operator, however, should limit its memory usage to avoid putting too much pressure on GC, e.g. falling back to sort-based aggregate as soon the number of objects exceeds a very low threshold.
> Internally at Databricks we prototyped a version of this for a customer POC and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org