You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/09/16 10:05:45 UTC

[jira] [Updated] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)

     [ https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-4273:
-----------------------------
    Assignee: Michael Armbrust

> Providing ExternalSet to avoid OOM when count(distinct)
> -------------------------------------------------------
>
>                 Key: SPARK-4273
>                 URL: https://issues.apache.org/jira/browse/SPARK-4273
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>            Reporter: YanTang Zhai
>            Assignee: Michael Armbrust
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory.
> I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold.
> For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows:
> ohs1 [d, b, c, a] => [a, b, c, d] => file1
> ohs2 [e, f, g, a] => [a, e, f, g] => file2
> ohs3 [e, h, i, g] => [e, g, h, i] => file3
> ohs4 [j, h, a] => [a, h, j] => sortedSet
> When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows:
> file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
> file1 -> b -> ohsB; ohsB -> b;
> file1 -> c -> ohsC; ohsC -> c;
> file1 -> d -> ohsD; ohsD -> d;
> file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
> ...
> I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org