You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Nitin Kumar <nk...@gmail.com> on 2016/04/21 07:56:58 UTC

Managing input split sizes in Hive running the tez engine

Hi,

I want to gain a better understanding of how in the input splits are
calculated in the tez engine.

I am aware that the *hive.input.format* property can be set to either
*HiveInputFormat* (default) or to *CombineHiveInputFormat* (generally
accepted for large number of files having sizes << hdfs block size).

I was hoping someone could walk me through the differences on how
*HiveInputFormat* and *CombineHiveInputFormat* calculate split sizes as
data file sizes vary from small (lesser than a block) to large (spanning
multiple blocks).

I want to dictate the number of mapper tasks that are spawned for scanning
a table. For the MR engine this can be controlled by setting the
*mapred.min.split.size* and *mapred.max.split.size* properties. I need to
know if there are similar configurations for the tez engine.

Also the properties *tez.grouping.max-size*, *tez.**grouping.**min-size*
and *tez.grouping.split-waves *have been set to the values of 1GB, 16MB and
1.7 respectively. However I observed that the created input splits do not
adhere to these properties.

I had two files of size 3MB each for a table. According to the set
properties, only 1 mapper task should have spawned but 2 mapper tasks
spawned instead.

Are there other properties in hive/tez that need to be set to enable input
split grouping?

I would highly appreciate your inputs.

Thanks and regards,
Nitin