You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "YanTang Zhai (JIRA)" <ji...@apache.org> on 2014/11/06 14:37:33 UTC

[jira] [Created] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)

YanTang Zhai created SPARK-4273:
-----------------------------------

             Summary: Providing ExternalSet to avoid OOM when count(distinct)
                 Key: SPARK-4273
                 URL: https://issues.apache.org/jira/browse/SPARK-4273
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
            Reporter: YanTang Zhai
            Priority: Minor


Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows:
ohs1 [d, b, c, a] => [a, b, c, d] => file1
ohs2 [e, f, g, a] => [a, e, f, g] => file2
ohs3 [e, h, i, g] => [e, g, h, i] => file3
ohs4 [j, h, a] => [a, h, j] => sortedSet
When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows:
file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
file1 -> b -> ohsB; ohsB -> b;
file1 -> c -> ohsC; ohsC -> c;
file1 -> d -> ohsD; ohsD -> d;
file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
...

I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org