You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Hengyu Dai (JIRA)" <ji...@apache.org> on 2017/12/06 10:42:00 UTC
[jira] [Created] (HIVE-18234) Hive MergeFileTask doesn't work
correctly
Hengyu Dai created HIVE-18234:
---------------------------------
Summary: Hive MergeFileTask doesn't work correctly
Key: HIVE-18234
URL: https://issues.apache.org/jira/browse/HIVE-18234
Project: Hive
Issue Type: Bug
Components: Hive
Affects Versions: 2.1.1
Reporter: Hengyu Dai
For MergeFileTask, Hive will read hive.merge.mapfiles, hive.merge.mapredfiles, hive.merge.size.per.task, hive.merge.smallfiles.avgsize these property to determine whether to generate a MergeFileTask to merge small files, if merge is needed, then hive will generate a MergeFileTask/MapWork to merge files, the property will finally be set to MapWork#maxSplitSize, maxSplitSize#minSplitSize, maxSplitSize#minSplitSizePerNode, minSplitSizePerRack#minSplitSizePerRack.
But Hive doesn't use these settings when commit Map task to Hadoop, i.e., the corresponding settings of Hadoop: "mapred.max.split.size" "mapred.min.split.size.per.node" "mapred.min.split.size.per.rack" are not set by these Hive setting. SO, those Hive setting does not take effect for MergeFileTask.
steps to reproduce:
this sql will still produce many small files(less than 20MB)
{code:sql}
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=1000000000;
insert overwrite table foo partition(dt='20171203')
select * from bar;
{code}
to fix these problem, I think we should set these property to Hadoop in MergeFileTask,
those code takes effect to me
{code:java}
// in MergeFileTask#execute()
job.setInputFormat(work.getInputformatClass());
job.setOutputFormat(HiveOutputFormatImpl.class);
job.setMapperClass(MergeFileMapper.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(0);
// set these property
job.setLong("mapred.max.split.size", work.getMaxSplitSize());
job.setLong("mapred.min.split.size.per.rack", work.getMinSplitSizePerRack());
job.setLong("mapred.min.split.size.per.node", work.getMinSplitSizePerNode());
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)