You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rob Verkuylen <ro...@verkuylen.net> on 2019/07/26 14:20:16 UTC

CombineHiveInputFormat not splitting as expected

Hi all,

We are compacting snappy compressed sequence files in our input stream on
which we run Hive queries. Generally working towards files of 50+ GB.

We see some unexpected behaviour where even smaller files, for example
5.2GB are only getting a single mapper in a Hive query using the default
CombineHiveInputFormat. Switching to HiveInputFormat gives us the 21
mappers in the Hive query which is what we would expect using 256MB block
sizes.

What would be the drawbacks for switching over to HiveInputFormat over
Combined? I would imaging more potential splits when we would have many
smaller files on the same node, which in our case would not happen that
often and we have enough resources to handle the potential extra mappers.
Is this thinking correct? Any other drawbacks?

Best regards
Rob