You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Avrilia Floratou <av...@gmail.com> on 2013/12/30 05:36:09 UTC

ORC file tuning

Hi all,

I'm using Hive 0.12 and running some experiments with the ORC file. The
hdfs block size is 128MB and I was wondering what is the best stripe size
to use. The default one (250MB) is larger than the block size. Is each
stripe splittable or in this case each map task will have to access data
over the network? I also tried to set the stripe size to 128 MB (same as
the block size) using the tblproperties in the create table statement but
noticed that for a file of about 544GB, 2026 map tasks are launched which
means that each split corresponds to about 250 MB. Is there anything else I
should do to align the block size, stripe size and split size in the orc
file?

Thanks,
Avrilia

Re: ORC file tuning

Posted by Yin Huai <hu...@gmail.com>.
Hi Avrilia,

In org.apache.hadoop.hive.ql.io.orc.WriterImpl, the block size is
determined by Math.min(1.5GB, 2 * stripeSize). Also, you can use
"orc.block.padding" in the table property to control whether the writer to
pad HDFS blocks to prevent stripes from straddling blocks. The default
value of this flag is true.

Thanks,

Yin

On Sun, Dec 29, 2013 at 11:36 PM, Avrilia Floratou <
avrilia.floratou@gmail.com> wrote:

> best stripe size to use. The default one (250MB) is larger than the block
> size. Is each stripe splittable or in this case each map task will have to
> access data over the network? I also tried to set the stripe size to 128 MB
> (same as the block size) using the tblproperties in the create table
> statement but noticed that for a file of about 544GB, 2026 map tasks are
> launched which means that each split corresponds to about 250 MB. Is there
> anything else I should do to align the block size, stripe size and split
> size in the orc file?