You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Venki Korukanti (JIRA)" <ji...@apache.org> on 2014/08/05 23:19:12 UTC

[jira] [Updated] (HIVE-7492) Enhance SparkCollector

     [ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Venki Korukanti updated HIVE-7492:
----------------------------------

    Attachment: HIVE-7492-1-spark.patch

Attaching a patch.

Instead of processing all records at once and returning an Iterable, lazily evaluate input records when output is requested by downstream consumer of returned Iterable. From the PairFlat(Map/Reduce)Function implementation, we return a custom implementation of Iterable which returns again a custom Iterator. This custom iterator will take initialized ExecMapper/ExecReducer and input record Iterator. When hasNext() in custom Iterator is called it reads record(s) from input Iterator and applies ExecMapper/ExecReducer function. Output records generated by processing a one record are stored in a HiveKVResultCache which has support for spilling if the number of output records exceeds a certain number (currently 512). The next() method in custom Iterator returns the results from HiveKVResultCache.

> Enhance SparkCollector
> ----------------------
>
>                 Key: HIVE-7492
>                 URL: https://issues.apache.org/jira/browse/HIVE-7492
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Venki Korukanti
>         Attachments: HIVE-7492-1-spark.patch
>
>
> SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)