You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Daniel,Wu" <ha...@163.com> on 2011/08/27 11:45:29 UTC

how to let one map task read multiple files?

I have a files of 7G, and the load using the command of
load data  local inpath '/home/oracle/store_sales.csv' into table store_sales; 

That file is not compressed, so I want to compress the table to make it work faster ( I don't know how to let hive work on a compress file directly), So I use the command
create table test as select * from store_sales,
in this way it create 113 files compressed in snappy. and each file is of size less than 10M (might because one snapply file is the result of compressing one block of HDFS), then is run any query,  it always kick  113 map tasks. Since the cluster only has 3 nodes, so I need to let it run only 3 map task. I set  mapred.min.split.size to 350M (total size of compressed files are less than 1G, so 350M*3 > 1G),  but it still kicks off 113 map tasks. What parameter I need to enable to make it run 3 map tasks?
  -rw-r--r--   3 oracle supergroup   10156524 2011-08-27 17:06 /user/hive/warehouse/test/000000_0.snappy
-rw-r--r--   3 oracle supergroup   10063292 2011-08-27 17:06 /user/hive/warehouse/test/000001_0.snappy
-rw-r--r--   3 oracle supergroup   10057315 2011-08-27 17:06 /user/hive/warehouse/test/000002_0.snappy
-rw-r--r--   3 oracle supergroup   10016039 2011-08-27 17:06 /user/hive/warehouse/test/000003_0.snappy
-rw-r--r--   3 oracle supergroup    9845530 2011-08-27 17:06 /user/hive/warehouse/test/000004_0.snappy
-rw-r--r--   3 oracle supergroup    9819626 2011-08-27 17:06 /user/hive/warehouse/test/000005_0.snappy
-rw-r--r--   3 oracle supergroup    9801408 2011-08-27 17:07 /user/hive/warehouse/test/000006_0.snappy
-rw-r--r--   3 oracle supergroup    9776102 2011-08-27 17:07 /user/hive/warehouse/test/000007_0.snappy
-rw-r--r--   3 oracle supergroup    9772285 2011-08-27 17:07 /user/hive/warehouse/test/000008_0.snappy
-rw-r--r--   3 oracle supergroup    9764841 2011-08-27 17:07 /user/hive/warehouse/test/000009_0.snappy
-rw-r--r--   3 oracle supergroup    9738481 2011-08-27 17:07 /user/hive/warehouse/test/000010_0.snappy
-rw-r--r--   3 oracle supergroup    9694980 2011-08-27 17:07 /user/hive/warehouse/test/000011_0.snappy
-rw-r--r--   3 oracle supergroup    9663682 2011-08-27 17:07 /user/hive/warehouse/test/000012_0.snappy
-rw-r--r--   3 oracle supergroup    9643515 2011-08-27 17:07 /user/hive/warehouse/test/000013_0.snappy
-rw-r--r--   3 oracle supergroup    9634152 2011-08-27 17:07 /user/hive/warehouse/test/000014_0.snappy
-rw-r--r--   3 oracle supergroup    9631661 2011-08-27 17:07 /user/hive/warehouse/test/000015_0.snappy
-rw-r--r--   3 oracle supergroup    9625304 2011-08-27 17:07 /user/hive/warehouse/test/000016_0.snappy
-rw-r--r--   3 oracle supergroup    9617673 2011-08-27 17:07 /user/hive/warehouse/test/000017_0.snappy
-rw-r--r--   3 oracle supergroup    9612474 2011-08-27 17:08 /user/hive/warehouse/test/000018_0.snappy
-rw-r--r--   3 oracle supergroup    9608600 2011-08-27 17:08 /user/hive/warehouse/test/000019_0.snappy
-rw-r--r--   3 oracle supergroup    9600738 2011-08-27 17:08 /user/hive/warehouse/test/000020_0.snappy
-rw-r--r--   3 oracle supergroup    9555315 2011-08-27 17:08 /user/hive/warehouse/test/000021_0.snappy
-rw-r--r--   3 oracle supergroup    9550699 2011-08-27 17:08 /user/hive/warehouse/test/000022_0.snappy
-rw-r--r--   3 oracle supergroup    9550166 2011-08-27 17:08 /user/hive/warehouse/test/000023_0.snappy
-rw-r--r--   3 oracle supergroup    9546121 2011-08-27 17:08 /user/hive/warehouse/test/000024_0.snappy
-rw-r--r--   3 oracle supergroup    9542885 2011-08-27 17:08 /user/hive/warehouse/test/000025_0.snappy


RE: how to let one map task read multiple files?

Posted by "Aggarwal, Vaibhav" <va...@amazon.com>.
CombineFileInputFormat can be used to combine multiple files into one map task.
But CombineFileInputFormat does not attempt to combine compressed files.
It defaults to the HiveFileInputFormat which creates at least one map task per file.

7G of data is not a lot for 3 node cluster to process and you could consider using sequence file format as the intermediate format.

From: Daniel,Wu [mailto:hadoop_wu@163.com]
Sent: Saturday, August 27, 2011 2:45 AM
To: hive
Subject: how to let one map task read multiple files?

I have a files of 7G, and the load using the command of
load data  local inpath '/home/oracle/store_sales.csv' into table store_sales;

That file is not compressed, so I want to compress the table to make it work faster ( I don't know how to let hive work on a compress file directly), So I use the command
create table test as select * from store_sales,
in this way it create 113 files compressed in snappy. and each file is of size less than 10M (might because one snapply file is the result of compressing one block of HDFS), then is run any query,  it always kick  113 map tasks. Since the cluster only has 3 nodes, so I need to let it run only 3 map task. I set  mapred.min.split.size to 350M (total size of compressed files are less than 1G, so 350M*3 > 1G),  but it still kicks off 113 map tasks. What parameter I need to enable to make it run 3 map tasks ?
  -rw-r--r--   3 oracle supergroup   10156524 2011-08-27 17:06 /user/hive/warehouse/test/000000_0.snappy
-rw-r--r--   3 oracle supergroup   10063292 2011-08-27 17:06 /user/hive/warehouse/test/000001_0.snappy
-rw-r--r--   3 oracle supergroup   10057315 2011-08-27 17:06 /user/hive/warehouse/test/000002_0.snappy
-rw-r--r--   3 oracle supergroup   10016039 2011-08-27 17:06 /user/hive/warehouse/test/000003_0.snappy
-rw-r--r--   3 oracle supergroup    9845530 2011-08-27 17:06 /user/hive/warehouse/test/000004_0.snappy
-rw-r--r--   3 oracle supergroup    9819626 2011-08-27 17:06 /user/hive/warehouse/test/000005_0.snappy
-rw-r--r--   3 oracle supergroup    9801408 2011-08-27 17:07 /user/hive/warehouse/test/000006_0.snappy
-rw-r--r--   3 oracle supergroup    9776102 2011-08-27 17:07 /user/h ive/warehouse/test/000007_0.snappy
-rw-r--r--   3 oracle supergroup    9772285 2011-08-27 17:07 /user/hive/warehouse/test/000008_0.snappy
-rw-r--r--   3 oracle supergroup    9764841 2011-08-27 17:07 /user/hive/warehouse/test/000009_0.snappy
-rw-r--r--   3 oracle supergroup    9738481 2011-08-27 17:07 /user/hive/warehouse/test/000010_0.snappy
-rw-r--r--   3 oracle supergroup    9694980 2011-08-27 17:07 /user/hive/warehouse/test/000011_0.snappy
-rw-r--r--   3 oracle supergroup    9663682 2011-08-27 17:07 /user/hive/warehouse/test/000012_0.snappy
-rw-r--r--   3 oracle supergroup    9643515 2011-08-27 17:07 /user/hive/warehouse/test/000013_0.snappy
-rw-r--r--   3 oracle supergroup    9634152 2011-08-27 17:07 /user/hive/warehouse/test/000014_0.snappy
-rw-r--r--   3 oracle su pergroup    9631661 2011-08-27 17:07 /user/hive/warehouse/test/000015_0.snappy
-rw-r--r--   3 oracle supergroup    9625304 2011-08-27 17:07 /user/hive/warehouse/test/000016_0.snappy
-rw-r--r--   3 oracle supergroup    9617673 2011-08-27 17:07 /user/hive/warehouse/test/000017_0.snappy
-rw-r--r--   3 oracle supergroup    9612474 2011-08-27 17:08 /user/hive/warehouse/test/000018_0.snappy
-rw-r--r--   3 oracle supergroup    9608600 2011-08-27 17:08 /user/hive/warehouse/test/000019_0.snappy
-rw-r--r--   3 oracle supergroup    9600738 2011-08-27 17:08 /user/hive/warehouse/test/000020_0.snappy
-rw-r--r--   3 oracle supergroup    9555315 2011-08-27 17:08 /user/hive/warehouse/test/000021_0.snappy
-rw-r--r--   3 oracle supergroup    9550699 2011-08-27 17:08 /user/hive/warehouse /test/000022_0.snappy
-rw-r--r--   3 oracle supergroup    9550166 2011-08-27 17:08 /user/hive/warehouse/test/000023_0.snappy
-rw-r--r--   3 oracle supergroup    9546121 2011-08-27 17:08 /user/hive/warehouse/test/000024_0.snappy
-rw-r--r--   3 oracle supergroup    9542885 2011-08-27 17:08 /user/hive/warehouse/test/000025_0.snappy