You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Avrilia Floratou <av...@gmail.com> on 2013/12/30 05:36:09 UTC
ORC file tuning
Hi all,
I'm using Hive 0.12 and running some experiments with the ORC file. The
hdfs block size is 128MB and I was wondering what is the best stripe size
to use. The default one (250MB) is larger than the block size. Is each
stripe splittable or in this case each map task will have to access data
over the network? I also tried to set the stripe size to 128 MB (same as
the block size) using the tblproperties in the create table statement but
noticed that for a file of about 544GB, 2026 map tasks are launched which
means that each split corresponds to about 250 MB. Is there anything else I
should do to align the block size, stripe size and split size in the orc
file?
Thanks,
Avrilia
Re: ORC file tuning
Posted by Yin Huai <hu...@gmail.com>.
Hi Avrilia,
In org.apache.hadoop.hive.ql.io.orc.WriterImpl, the block size is
determined by Math.min(1.5GB, 2 * stripeSize). Also, you can use
"orc.block.padding" in the table property to control whether the writer to
pad HDFS blocks to prevent stripes from straddling blocks. The default
value of this flag is true.
Thanks,
Yin
On Sun, Dec 29, 2013 at 11:36 PM, Avrilia Floratou <
avrilia.floratou@gmail.com> wrote:
> best stripe size to use. The default one (250MB) is larger than the block
> size. Is each stripe splittable or in this case each map task will have to
> access data over the network? I also tried to set the stripe size to 128 MB
> (same as the block size) using the tblproperties in the create table
> statement but noticed that for a file of about 544GB, 2026 map tasks are
> launched which means that each split corresponds to about 250 MB. Is there
> anything else I should do to align the block size, stripe size and split
> size in the orc file?