You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by java8964 java8964 <ja...@hotmail.com> on 2013/10/02 22:22:01 UTC

Will different files in HDFS trigger different mapper

Hi, I have a question related to how the mapper generated for the input files from HDFS. I understand the split and blocks concept in the HDFS, but my originally understanding is that one mapper will only process data from one file in HDFS, no matter how small this file it is. Is that correct?
The reason for this is that in some ETL, I did see the logic to understand the data set based on the file name convention. So in the mapper, before processing the first KV, we can build some logic in the map() method to get the file name of the current input, and init some logic here. After that, we don't need to worry data could be from another file later, as one mapper task will only handle data from one file, even when the file is very small. So small files not only cause trouble in NN memory, it also wastes the Map tasks, as map task could consume too less data.
But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 
select partition_column, count(*) from test_table group by partition_column
It only generates 2 mappers in MR job. This is an external hive table, and the input bytes for this MR job is only 338M, but the data files in the HDFS for this table is more than 100, even though a lot of them is very small, as this is one node cluster, but it is configured as one node full cluster mode, not local mode. Should the MR job generated here trigger at least 100 mappers? Is this because in hive that my original assumption not work any more?
Thanks
Yong

RE: Will different files in HDFS trigger different mapper

Posted by Sourygna Luangsay <sl...@pragsis.com>.

Hi,

 

If you have lot of small files, by default Hive will group various of them
in a single mapper.



Check this property:

hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
(default, if empty)  => if you set it to
org.apache.hadoop.hive.ql.io.HiveInputFormat, you´ll get 100 maps (and a
much slower MapReduce job).

 

Other properties enable to tune it. For instance:

mapred.max.split.size

hive.merge.mapfiles=true;
hive.merge.mapredfiles=true;

 

Regards,

 

Sourygna 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: miércoles, 02 de octubre de 2013 22:22
To: user@hadoop.apache.org
Subject: Will different files in HDFS trigger different mapper

 

Hi, I have a question related to how the mapper generated for the input
files from HDFS. I understand the split and blocks concept in the HDFS, but
my originally understanding is that one mapper will only process data from
one file in HDFS, no matter how small this file it is. Is that correct?

 

The reason for this is that in some ETL, I did see the logic to understand
the data set based on the file name convention. So in the mapper, before
processing the first KV, we can build some logic in the map() method to get
the file name of the current input, and init some logic here. After that, we
don't need to worry data could be from another file later, as one mapper
task will only handle data from one file, even when the file is very small.
So small files not only cause trouble in NN memory, it also wastes the Map
tasks, as map task could consume too less data.

 

But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 

 

select partition_column, count(*) from test_table group by partition_column

 

It only generates 2 mappers in MR job. This is an external hive table, and
the input bytes for this MR job is only 338M, but the data files in the HDFS
for this table is more than 100, even though a lot of them is very small, as
this is one node cluster, but it is configured as one node full cluster
mode, not local mode. Should the MR job generated here trigger at least 100
mappers? Is this because in hive that my original assumption not work any
more?

 

Thanks

 

Yong

RE: Will different files in HDFS trigger different mapper

Posted by Sourygna Luangsay <sl...@pragsis.com>.

Hi,

 

If you have lot of small files, by default Hive will group various of them
in a single mapper.



Check this property:

hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
(default, if empty)  => if you set it to
org.apache.hadoop.hive.ql.io.HiveInputFormat, you´ll get 100 maps (and a
much slower MapReduce job).

 

Other properties enable to tune it. For instance:

mapred.max.split.size

hive.merge.mapfiles=true;
hive.merge.mapredfiles=true;

 

Regards,

 

Sourygna 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: miércoles, 02 de octubre de 2013 22:22
To: user@hadoop.apache.org
Subject: Will different files in HDFS trigger different mapper

 

Hi, I have a question related to how the mapper generated for the input
files from HDFS. I understand the split and blocks concept in the HDFS, but
my originally understanding is that one mapper will only process data from
one file in HDFS, no matter how small this file it is. Is that correct?

 

The reason for this is that in some ETL, I did see the logic to understand
the data set based on the file name convention. So in the mapper, before
processing the first KV, we can build some logic in the map() method to get
the file name of the current input, and init some logic here. After that, we
don't need to worry data could be from another file later, as one mapper
task will only handle data from one file, even when the file is very small.
So small files not only cause trouble in NN memory, it also wastes the Map
tasks, as map task could consume too less data.

 

But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 

 

select partition_column, count(*) from test_table group by partition_column

 

It only generates 2 mappers in MR job. This is an external hive table, and
the input bytes for this MR job is only 338M, but the data files in the HDFS
for this table is more than 100, even though a lot of them is very small, as
this is one node cluster, but it is configured as one node full cluster
mode, not local mode. Should the MR job generated here trigger at least 100
mappers? Is this because in hive that my original assumption not work any
more?

 

Thanks

 

Yong

RE: Will different files in HDFS trigger different mapper

Posted by Sourygna Luangsay <sl...@pragsis.com>.

Hi,

 

If you have lot of small files, by default Hive will group various of them
in a single mapper.



Check this property:

hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
(default, if empty)  => if you set it to
org.apache.hadoop.hive.ql.io.HiveInputFormat, you´ll get 100 maps (and a
much slower MapReduce job).

 

Other properties enable to tune it. For instance:

mapred.max.split.size

hive.merge.mapfiles=true;
hive.merge.mapredfiles=true;

 

Regards,

 

Sourygna 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: miércoles, 02 de octubre de 2013 22:22
To: user@hadoop.apache.org
Subject: Will different files in HDFS trigger different mapper

 

Hi, I have a question related to how the mapper generated for the input
files from HDFS. I understand the split and blocks concept in the HDFS, but
my originally understanding is that one mapper will only process data from
one file in HDFS, no matter how small this file it is. Is that correct?

 

The reason for this is that in some ETL, I did see the logic to understand
the data set based on the file name convention. So in the mapper, before
processing the first KV, we can build some logic in the map() method to get
the file name of the current input, and init some logic here. After that, we
don't need to worry data could be from another file later, as one mapper
task will only handle data from one file, even when the file is very small.
So small files not only cause trouble in NN memory, it also wastes the Map
tasks, as map task could consume too less data.

 

But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 

 

select partition_column, count(*) from test_table group by partition_column

 

It only generates 2 mappers in MR job. This is an external hive table, and
the input bytes for this MR job is only 338M, but the data files in the HDFS
for this table is more than 100, even though a lot of them is very small, as
this is one node cluster, but it is configured as one node full cluster
mode, not local mode. Should the MR job generated here trigger at least 100
mappers? Is this because in hive that my original assumption not work any
more?

 

Thanks

 

Yong

RE: Will different files in HDFS trigger different mapper

Posted by Sourygna Luangsay <sl...@pragsis.com>.

Hi,

 

If you have lot of small files, by default Hive will group various of them
in a single mapper.



Check this property:

hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
(default, if empty)  => if you set it to
org.apache.hadoop.hive.ql.io.HiveInputFormat, you´ll get 100 maps (and a
much slower MapReduce job).

 

Other properties enable to tune it. For instance:

mapred.max.split.size

hive.merge.mapfiles=true;
hive.merge.mapredfiles=true;

 

Regards,

 

Sourygna 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: miércoles, 02 de octubre de 2013 22:22
To: user@hadoop.apache.org
Subject: Will different files in HDFS trigger different mapper

 

Hi, I have a question related to how the mapper generated for the input
files from HDFS. I understand the split and blocks concept in the HDFS, but
my originally understanding is that one mapper will only process data from
one file in HDFS, no matter how small this file it is. Is that correct?

 

The reason for this is that in some ETL, I did see the logic to understand
the data set based on the file name convention. So in the mapper, before
processing the first KV, we can build some logic in the map() method to get
the file name of the current input, and init some logic here. After that, we
don't need to worry data could be from another file later, as one mapper
task will only handle data from one file, even when the file is very small.
So small files not only cause trouble in NN memory, it also wastes the Map
tasks, as map task could consume too less data.

 

But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 

 

select partition_column, count(*) from test_table group by partition_column

 

It only generates 2 mappers in MR job. This is an external hive table, and
the input bytes for this MR job is only 338M, but the data files in the HDFS
for this table is more than 100, even though a lot of them is very small, as
this is one node cluster, but it is configured as one node full cluster
mode, not local mode. Should the MR job generated here trigger at least 100
mappers? Is this because in hive that my original assumption not work any
more?

 

Thanks

 

Yong