You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Stephen Boesch <ja...@gmail.com> on 2013/09/20 23:46:30 UTC

Loading data into partition taking seven times total of (map+reduce) on highly skewed data

We have a small (3GB /280M rows) table with 435 partitions that is highly
skewed:  one partition has nearly 200M, two others have nearly 40M apiece,
then the remaining 432 have all together less than 1% of total table size.

So .. the skew is something to be addressed.  However - even give that -
why would the following occur?


Table Structure:

     # Partition Information
# col_name             data_type           comment
 derived_create_dt   string               None

# Detailed Table Information
 ..
Protect Mode:       None
Retention:           0
 ..
Table Type:         MANAGED_TABLE
Table Parameters:
 SORTBUCKETCOLSPREFIX TRUE
transient_lastDdlTime 1379678551

# Storage Information
SerDe Library:       org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
 InputFormat:         org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:       org.apache.hadoop.hive.ql.io.RCFileOutputFormat
 Compressed:         No
Num Buckets:         64
 Bucket Columns:     [station_id]
Sort Columns:       [Order(col:station_id, order:1)]
 Storage Desc Params:
serialization.format 1

HIGHLY SKEWED data:  although
This particular load:
    300M rows
     4GB
    435 partitions
       Over 99% of data in just 3 out of the 435 partitons
        2013-09-18 26733990
      2013-09-19 191634067
      2013-09-20 63790065



Map takes 10 min
Reduce 13 mins
Loading into partitions takes 3 hours 27 minutes

Re: Loading data into partition taking seven times total of (map+reduce) on highly skewed data

Posted by Stephen Boesch <ja...@gmail.com>.

Another detail:   ~400 mappers  64 reducers


2013/9/20 Stephen Boesch <ja...@gmail.com>

>
> We have a small (3GB /280M rows) table with 435 partitions that is highly
> skewed:  one partition has nearly 200M, two others have nearly 40M apiece,
> then the remaining 432 have all together less than 1% of total table size.
>
> So .. the skew is something to be addressed.  However - even give that -
> why would the following occur?
>
>
> Table Structure:
>
>      # Partition Information
> # col_name             data_type           comment
>  derived_create_dt   string               None
>
> # Detailed Table Information
>  ..
> Protect Mode:       None
> Retention:           0
>  ..
> Table Type:         MANAGED_TABLE
> Table Parameters:
>  SORTBUCKETCOLSPREFIX TRUE
> transient_lastDdlTime 1379678551
>
> # Storage Information
> SerDe Library:       org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
>  InputFormat:         org.apache.hadoop.hive.ql.io.RCFileInputFormat
> OutputFormat:       org.apache.hadoop.hive.ql.io.RCFileOutputFormat
>  Compressed:         No
> Num Buckets:         64
>  Bucket Columns:     [station_id]
> Sort Columns:       [Order(col:station_id, order:1)]
>  Storage Desc Params:
> serialization.format 1
>
> HIGHLY SKEWED data:  although
> This particular load:
>     300M rows
>      4GB
>     435 partitions
>        Over 99% of data in just 3 out of the 435 partitons
>         2013-09-18 26733990
>       2013-09-19 191634067
>       2013-09-20 63790065
>
>
>
> Map takes 10 min
> Reduce 13 mins
> Loading into partitions takes 3 hours 27 minutes
>
>
>