You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Stephen Boesch <ja...@gmail.com> on 2013/09/20 23:46:30 UTC
Loading data into partition taking seven times total of (map+reduce)
on highly skewed data
We have a small (3GB /280M rows) table with 435 partitions that is highly
skewed: one partition has nearly 200M, two others have nearly 40M apiece,
then the remaining 432 have all together less than 1% of total table size.
So .. the skew is something to be addressed. However - even give that -
why would the following occur?
Table Structure:
# Partition Information
# col_name data_type comment
derived_create_dt string None
# Detailed Table Information
..
Protect Mode: None
Retention: 0
..
Table Type: MANAGED_TABLE
Table Parameters:
SORTBUCKETCOLSPREFIX TRUE
transient_lastDdlTime 1379678551
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
Compressed: No
Num Buckets: 64
Bucket Columns: [station_id]
Sort Columns: [Order(col:station_id, order:1)]
Storage Desc Params:
serialization.format 1
HIGHLY SKEWED data: although
This particular load:
300M rows
4GB
435 partitions
Over 99% of data in just 3 out of the 435 partitons
2013-09-18 26733990
2013-09-19 191634067
2013-09-20 63790065
Map takes 10 min
Reduce 13 mins
Loading into partitions takes 3 hours 27 minutes
Re: Loading data into partition taking seven times total of
(map+reduce) on highly skewed data
Posted by Stephen Boesch <ja...@gmail.com>.
Another detail: ~400 mappers 64 reducers
2013/9/20 Stephen Boesch <ja...@gmail.com>
>
> We have a small (3GB /280M rows) table with 435 partitions that is highly
> skewed: one partition has nearly 200M, two others have nearly 40M apiece,
> then the remaining 432 have all together less than 1% of total table size.
>
> So .. the skew is something to be addressed. However - even give that -
> why would the following occur?
>
>
> Table Structure:
>
> # Partition Information
> # col_name data_type comment
> derived_create_dt string None
>
> # Detailed Table Information
> ..
> Protect Mode: None
> Retention: 0
> ..
> Table Type: MANAGED_TABLE
> Table Parameters:
> SORTBUCKETCOLSPREFIX TRUE
> transient_lastDdlTime 1379678551
>
> # Storage Information
> SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
> InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat
> OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
> Compressed: No
> Num Buckets: 64
> Bucket Columns: [station_id]
> Sort Columns: [Order(col:station_id, order:1)]
> Storage Desc Params:
> serialization.format 1
>
> HIGHLY SKEWED data: although
> This particular load:
> 300M rows
> 4GB
> 435 partitions
> Over 99% of data in just 3 out of the 435 partitons
> 2013-09-18 26733990
> 2013-09-19 191634067
> 2013-09-20 63790065
>
>
>
> Map takes 10 min
> Reduce 13 mins
> Loading into partitions takes 3 hours 27 minutes
>
>
>