You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by John Omernik <jo...@omernik.com> on 2014/02/03 02:25:53 UTC

What are all the factors that go into the number of mappers - ORC

I have two clusters, but small dev clusters, and I loaded the same dataset
into both of them.   The data size on disk is within 2000 Bytes. Both are
ORC, one is Hive 11 and one is Hive 12.  One is allocating about 8 more
mappers to the exact same query. I am just curious what settings would
change that. I checked through all my setting, but can't see what would
cause the discrepancy. Is this an ORC v11 vs v12 thing?

I'd be curious on the thoughts of the group.

Re: What are all the factors that go into the number of mappers - ORC

Posted by John Omernik <jo...@omernik.com>.
No the size is closer to 10GB, the difference between the tables is only
around 2000 bytes.  I will try to get exact numbers for you soon, I am
traveling right now, but I'll get you better data to work with shortly.

Thanks!



On Mon, Feb 3, 2014 at 12:22 AM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi John
>
> Number of mappers is equal to the number of splits generated. Following
> are the factors that go into split generation
> 1) HDFS block size
> 2) Max split size
>
> a split is cut when
> 1) the cumulative size of all adjacent stripes are greater than HDFS block
> size
> 2) the cumulative size of all adjacent stripes are greater than max split
> size
>
> HDFS block size for ORC files will be min(1.5GB, 2*stripe_size) in the
> current version of hive (and probably hive 0.12 too). In older versions,
> HDFS block size = min(2GB, 2*stripe_size).
>
> The other important thing to note is ORC split is generated only when
> HiveInputFormat is used. By default hive uses CombineHiveInputFormat which
> uses a different strategy to generate splits. In CombineHiveInputFormat,
> many small files are combined together to form a large logical split.
>
> In any case for the size you had mentioned (2000 bytes) there should be
> only one mapper. Can you provide the value for following configs so that we
> can understand it better?
>
> 1) hive.input.format
> 2) hive.min.split.size
> 3) hive.max.split.size
> 4) total size on disk for the table
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 2, 2014, at 5:25 PM, John Omernik <jo...@omernik.com> wrote:
>
> > I have two clusters, but small dev clusters, and I loaded the same
> dataset into both of them.   The data size on disk is within 2000 Bytes.
> Both are ORC, one is Hive 11 and one is Hive 12.  One is allocating about 8
> more mappers to the exact same query. I am just curious what settings would
> change that. I checked through all my setting, but can't see what would
> cause the discrepancy. Is this an ORC v11 vs v12 thing?
> >
> > I'd be curious on the thoughts of the group.
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: What are all the factors that go into the number of mappers - ORC

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Hi John

Number of mappers is equal to the number of splits generated. Following are the factors that go into split generation
1) HDFS block size
2) Max split size

a split is cut when
1) the cumulative size of all adjacent stripes are greater than HDFS block size
2) the cumulative size of all adjacent stripes are greater than max split size

HDFS block size for ORC files will be min(1.5GB, 2*stripe_size) in the current version of hive (and probably hive 0.12 too). In older versions, HDFS block size = min(2GB, 2*stripe_size). 

The other important thing to note is ORC split is generated only when HiveInputFormat is used. By default hive uses CombineHiveInputFormat which uses a different strategy to generate splits. In CombineHiveInputFormat, many small files are combined together to form a large logical split.

In any case for the size you had mentioned (2000 bytes) there should be only one mapper. Can you provide the value for following configs so that we can understand it better?

1) hive.input.format
2) hive.min.split.size
3) hive.max.split.size
4) total size on disk for the table

Thanks
Prasanth Jayachandran

On Feb 2, 2014, at 5:25 PM, John Omernik <jo...@omernik.com> wrote:

> I have two clusters, but small dev clusters, and I loaded the same dataset into both of them.   The data size on disk is within 2000 Bytes. Both are ORC, one is Hive 11 and one is Hive 12.  One is allocating about 8 more mappers to the exact same query. I am just curious what settings would change that. I checked through all my setting, but can't see what would cause the discrepancy. Is this an ORC v11 vs v12 thing?
> 
> I'd be curious on the thoughts of the group.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.