You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ajo Fod <aj...@gmail.com> on 2011/08/02 18:35:21 UTC

Question about sorted tables

Hello Hive Gurus,

I am not sure if my system is using the sorting feature.

In summary:
- I expected to save time on the sorting step because I was using
pre-sorted data, but the query plan seem to indicate an intermediate
sorting step.

########################################
=== The  Setup  =======
########################################

========= I have set the following flags:
set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;

==== here I create a table to hold a temporary copy of data on disk ========
CREATE TABLE trades
       (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

===== here I copy the data on disk into the table ===========
===== BTW, the data here is clustered by symbol and sorted by time =======
===== I can't seem to get Hive to use this concept ... i.e avoid
sorting again =========
LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');

========= I use the following final table to enforce bucketing ===========
========= and to impose sort order ===========
CREATE TABLE alltrades
       (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

========== data is loaded from a hive table ==========
insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;

============
It is disappointing to see that any query on alltrades that requires
sorted symbol,time does the sorting all over again ... is there a way
around this?
Also, is there a way to make this whole process work in 1 query step
instead of 2 ?
===========

########################################
=== Why SORTING seems to not work =======
########################################

Note that the table was constructed and populated with the sort by clause.
I'm afraid that dropping these would lead a future reducer to behave
as if no sorting was required.

========== Here is the plan for a query that in my opinion should not
involve sorting ... but actually does.========
hive> explain select symbol, time, price from alltrades sort by symbol, time;

OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        alltrades
          TableScan
            alias: alltrades
            Select Operator
              expressions:
                    expr: symbol
                    type: string
                    expr: time
                    type: string
                    expr: price
                    type: float
              outputColumnNames: _col0, _col1, _col2
              Reduce Output Operator
                key expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: string
                sort order: ++
                tag: -1
                value expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: string
                      expr: _col2
                      type: float
      Reduce Operator Tree:
        Extract
          File Output Operator
            compressed: false
            GlobalTableId: 0
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

Re: Question about sorted tables

Posted by Ajo Fod <aj...@gmail.com>.

Hello,

Is this not the forum for this type of question? Is there another
forum someone recommends?

Thanks,
Ajo.

On Tue, Aug 2, 2011 at 9:35 AM, Ajo Fod <aj...@gmail.com> wrote:
> Hello Hive Gurus,
>
> I am not sure if my system is using the sorting feature.
>
> In summary:
> - I expected to save time on the sorting step because I was using
> pre-sorted data, but the query plan seem to indicate an intermediate
> sorting step.
>
> ########################################
> === The  Setup  =======
> ########################################
>
> ========= I have set the following flags:
> set hive.enforce.bucketing = true;
> set mapred.reduce.tasks=8;
> set mapred.map.tasks=8;
>
> ==== here I create a table to hold a temporary copy of data on disk ========
> CREATE TABLE trades
>       (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
> INT, bid FLOAT, ask FLOAT, time STRING)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>  STORED AS TEXTFILE;
>
> ===== here I copy the data on disk into the table ===========
> ===== BTW, the data here is clustered by symbol and sorted by time =======
> ===== I can't seem to get Hive to use this concept ... i.e avoid
> sorting again =========
> LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
> INTO TABLE trades
> partition (dt='2010-05-07');
>
> ========= I use the following final table to enforce bucketing ===========
> ========= and to impose sort order ===========
> CREATE TABLE alltrades
>       (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
> INT, bid FLOAT, ask FLOAT, time STRING)
> CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>  STORED AS TEXTFILE;
>
> ========== data is loaded from a hive table ==========
> insert overwrite table alltrades
> select symbol, exchange, price, volume, cond, bid, ask, time
> from trades
> distribute by symbol sort by symbol, time;
>
> ============
> It is disappointing to see that any query on alltrades that requires
> sorted symbol,time does the sorting all over again ... is there a way
> around this?
> Also, is there a way to make this whole process work in 1 query step
> instead of 2 ?
> ===========
>
> ########################################
> === Why SORTING seems to not work =======
> ########################################
>
> Note that the table was constructed and populated with the sort by clause.
> I'm afraid that dropping these would lead a future reducer to behave
> as if no sorting was required.
>
> ========== Here is the plan for a query that in my opinion should not
> involve sorting ... but actually does.========
> hive> explain select symbol, time, price from alltrades sort by symbol, time;
>
> OK
> ABSTRACT SYNTAX TREE:
>  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
> (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
> (TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
> time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
> (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
> (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
>
> STAGE DEPENDENCIES:
>  Stage-1 is a root stage
>  Stage-0 is a root stage
>
> STAGE PLANS:
>  Stage: Stage-1
>    Map Reduce
>      Alias -> Map Operator Tree:
>        alltrades
>          TableScan
>            alias: alltrades
>            Select Operator
>              expressions:
>                    expr: symbol
>                    type: string
>                    expr: time
>                    type: string
>                    expr: price
>                    type: float
>              outputColumnNames: _col0, _col1, _col2
>              Reduce Output Operator
>                key expressions:
>                      expr: _col0
>                      type: string
>                      expr: _col1
>                      type: string
>                sort order: ++
>                tag: -1
>                value expressions:
>                      expr: _col0
>                      type: string
>                      expr: _col1
>                      type: string
>                      expr: _col2
>                      type: float
>      Reduce Operator Tree:
>        Extract
>          File Output Operator
>            compressed: false
>            GlobalTableId: 0
>            table:
>                input format: org.apache.hadoop.mapred.TextInputFormat
>                output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>  Stage: Stage-0
>    Fetch Operator
>      limit: -1
>