You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ajo Fod <aj...@gmail.com> on 2011/08/02 18:35:21 UTC
Question about sorted tables
Hello Hive Gurus,
I am not sure if my system is using the sorting feature.
In summary:
- I expected to save time on the sorting step because I was using
pre-sorted data, but the query plan seem to indicate an intermediate
sorting step.
########################################
=== The Setup =======
########################################
========= I have set the following flags:
set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;
==== here I create a table to hold a temporary copy of data on disk ========
CREATE TABLE trades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
===== here I copy the data on disk into the table ===========
===== BTW, the data here is clustered by symbol and sorted by time =======
===== I can't seem to get Hive to use this concept ... i.e avoid
sorting again =========
LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');
========= I use the following final table to enforce bucketing ===========
========= and to impose sort order ===========
CREATE TABLE alltrades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
========== data is loaded from a hive table ==========
insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;
============
It is disappointing to see that any query on alltrades that requires
sorted symbol,time does the sorting all over again ... is there a way
around this?
Also, is there a way to make this whole process work in 1 query step
instead of 2 ?
===========
########################################
=== Why SORTING seems to not work =======
########################################
Note that the table was constructed and populated with the sort by clause.
I'm afraid that dropping these would lead a future reducer to behave
as if no sorting was required.
========== Here is the plan for a query that in my opinion should not
involve sorting ... but actually does.========
hive> explain select symbol, time, price from alltrades sort by symbol, time;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
alltrades
TableScan
alias: alltrades
Select Operator
expressions:
expr: symbol
type: string
expr: time
type: string
expr: price
type: float
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
sort order: ++
tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: float
Reduce Operator Tree:
Extract
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Re: Question about sorted tables
Posted by Ajo Fod <aj...@gmail.com>.
Hello,
Is this not the forum for this type of question? Is there another
forum someone recommends?
Thanks,
Ajo.
On Tue, Aug 2, 2011 at 9:35 AM, Ajo Fod <aj...@gmail.com> wrote:
> Hello Hive Gurus,
>
> I am not sure if my system is using the sorting feature.
>
> In summary:
> - I expected to save time on the sorting step because I was using
> pre-sorted data, but the query plan seem to indicate an intermediate
> sorting step.
>
> ########################################
> === The Setup =======
> ########################################
>
> ========= I have set the following flags:
> set hive.enforce.bucketing = true;
> set mapred.reduce.tasks=8;
> set mapred.map.tasks=8;
>
> ==== here I create a table to hold a temporary copy of data on disk ========
> CREATE TABLE trades
> (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
> INT, bid FLOAT, ask FLOAT, time STRING)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
>
> ===== here I copy the data on disk into the table ===========
> ===== BTW, the data here is clustered by symbol and sorted by time =======
> ===== I can't seem to get Hive to use this concept ... i.e avoid
> sorting again =========
> LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
> INTO TABLE trades
> partition (dt='2010-05-07');
>
> ========= I use the following final table to enforce bucketing ===========
> ========= and to impose sort order ===========
> CREATE TABLE alltrades
> (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
> INT, bid FLOAT, ask FLOAT, time STRING)
> CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
>
> ========== data is loaded from a hive table ==========
> insert overwrite table alltrades
> select symbol, exchange, price, volume, cond, bid, ask, time
> from trades
> distribute by symbol sort by symbol, time;
>
> ============
> It is disappointing to see that any query on alltrades that requires
> sorted symbol,time does the sorting all over again ... is there a way
> around this?
> Also, is there a way to make this whole process work in 1 query step
> instead of 2 ?
> ===========
>
> ########################################
> === Why SORTING seems to not work =======
> ########################################
>
> Note that the table was constructed and populated with the sort by clause.
> I'm afraid that dropping these would lead a future reducer to behave
> as if no sorting was required.
>
> ========== Here is the plan for a query that in my opinion should not
> involve sorting ... but actually does.========
> hive> explain select symbol, time, price from alltrades sort by symbol, time;
>
> OK
> ABSTRACT SYNTAX TREE:
> (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
> (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
> (TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
> time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
> (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
> (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> alltrades
> TableScan
> alias: alltrades
> Select Operator
> expressions:
> expr: symbol
> type: string
> expr: time
> type: string
> expr: price
> type: float
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> sort order: ++
> tag: -1
> value expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> expr: _col2
> type: float
> Reduce Operator Tree:
> Extract
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>