You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Navis (JIRA)" <ji...@apache.org> on 2014/05/12 09:14:15 UTC
[jira] [Updated] (HIVE-4867) Deduplicate columns appearing in both
the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Navis updated HIVE-4867:
------------------------
Status: Patch Available (was: Open)
> Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
> ---------------------------------------------------------------------------------------
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
> Issue Type: Improvement
> Reporter: Yin Huai
> Assignee: Yin Huai
> Attachments: HIVE-4867.1.patch.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, a column may appear in both the key list and value list, which result in unnecessary overhead for shuffling.
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> store_sales
> TableScan
> alias: store_sales
> Select Operator
> expressions:
> expr: ss_ticket_number
> type: int
> outputColumnNames: _col0
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: _col0
> type: int
> tag: -1
> value expressions:
> expr: _col0
> type: int
> Reduce Operator Tree:
> Extract
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the ReduceSinkOperator. The type of ss_ticket_number is int. For this case, BinarySortableSerDe will introduce 1 byte more for every int in the key. LazyBinarySerDe will also introduce overhead when recording the length of a int. For every int, 10 bytes should be a rough estimation of the size of data emitted from the Map phase.
--
This message was sent by Atlassian JIRA
(v6.2#6252)