You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Daniel,Wu" <ha...@163.com> on 2011/08/27 11:45:29 UTC
how to let one map task read multiple files?
I have a files of 7G, and the load using the command of
load data local inpath '/home/oracle/store_sales.csv' into table store_sales;
That file is not compressed, so I want to compress the table to make it work faster ( I don't know how to let hive work on a compress file directly), So I use the command
create table test as select * from store_sales,
in this way it create 113 files compressed in snappy. and each file is of size less than 10M (might because one snapply file is the result of compressing one block of HDFS), then is run any query, it always kick 113 map tasks. Since the cluster only has 3 nodes, so I need to let it run only 3 map task. I set mapred.min.split.size to 350M (total size of compressed files are less than 1G, so 350M*3 > 1G), but it still kicks off 113 map tasks. What parameter I need to enable to make it run 3 map tasks?
-rw-r--r-- 3 oracle supergroup 10156524 2011-08-27 17:06 /user/hive/warehouse/test/000000_0.snappy
-rw-r--r-- 3 oracle supergroup 10063292 2011-08-27 17:06 /user/hive/warehouse/test/000001_0.snappy
-rw-r--r-- 3 oracle supergroup 10057315 2011-08-27 17:06 /user/hive/warehouse/test/000002_0.snappy
-rw-r--r-- 3 oracle supergroup 10016039 2011-08-27 17:06 /user/hive/warehouse/test/000003_0.snappy
-rw-r--r-- 3 oracle supergroup 9845530 2011-08-27 17:06 /user/hive/warehouse/test/000004_0.snappy
-rw-r--r-- 3 oracle supergroup 9819626 2011-08-27 17:06 /user/hive/warehouse/test/000005_0.snappy
-rw-r--r-- 3 oracle supergroup 9801408 2011-08-27 17:07 /user/hive/warehouse/test/000006_0.snappy
-rw-r--r-- 3 oracle supergroup 9776102 2011-08-27 17:07 /user/hive/warehouse/test/000007_0.snappy
-rw-r--r-- 3 oracle supergroup 9772285 2011-08-27 17:07 /user/hive/warehouse/test/000008_0.snappy
-rw-r--r-- 3 oracle supergroup 9764841 2011-08-27 17:07 /user/hive/warehouse/test/000009_0.snappy
-rw-r--r-- 3 oracle supergroup 9738481 2011-08-27 17:07 /user/hive/warehouse/test/000010_0.snappy
-rw-r--r-- 3 oracle supergroup 9694980 2011-08-27 17:07 /user/hive/warehouse/test/000011_0.snappy
-rw-r--r-- 3 oracle supergroup 9663682 2011-08-27 17:07 /user/hive/warehouse/test/000012_0.snappy
-rw-r--r-- 3 oracle supergroup 9643515 2011-08-27 17:07 /user/hive/warehouse/test/000013_0.snappy
-rw-r--r-- 3 oracle supergroup 9634152 2011-08-27 17:07 /user/hive/warehouse/test/000014_0.snappy
-rw-r--r-- 3 oracle supergroup 9631661 2011-08-27 17:07 /user/hive/warehouse/test/000015_0.snappy
-rw-r--r-- 3 oracle supergroup 9625304 2011-08-27 17:07 /user/hive/warehouse/test/000016_0.snappy
-rw-r--r-- 3 oracle supergroup 9617673 2011-08-27 17:07 /user/hive/warehouse/test/000017_0.snappy
-rw-r--r-- 3 oracle supergroup 9612474 2011-08-27 17:08 /user/hive/warehouse/test/000018_0.snappy
-rw-r--r-- 3 oracle supergroup 9608600 2011-08-27 17:08 /user/hive/warehouse/test/000019_0.snappy
-rw-r--r-- 3 oracle supergroup 9600738 2011-08-27 17:08 /user/hive/warehouse/test/000020_0.snappy
-rw-r--r-- 3 oracle supergroup 9555315 2011-08-27 17:08 /user/hive/warehouse/test/000021_0.snappy
-rw-r--r-- 3 oracle supergroup 9550699 2011-08-27 17:08 /user/hive/warehouse/test/000022_0.snappy
-rw-r--r-- 3 oracle supergroup 9550166 2011-08-27 17:08 /user/hive/warehouse/test/000023_0.snappy
-rw-r--r-- 3 oracle supergroup 9546121 2011-08-27 17:08 /user/hive/warehouse/test/000024_0.snappy
-rw-r--r-- 3 oracle supergroup 9542885 2011-08-27 17:08 /user/hive/warehouse/test/000025_0.snappy
RE: how to let one map task read multiple files?
Posted by "Aggarwal, Vaibhav" <va...@amazon.com>.
CombineFileInputFormat can be used to combine multiple files into one map task.
But CombineFileInputFormat does not attempt to combine compressed files.
It defaults to the HiveFileInputFormat which creates at least one map task per file.
7G of data is not a lot for 3 node cluster to process and you could consider using sequence file format as the intermediate format.
From: Daniel,Wu [mailto:hadoop_wu@163.com]
Sent: Saturday, August 27, 2011 2:45 AM
To: hive
Subject: how to let one map task read multiple files?
I have a files of 7G, and the load using the command of
load data local inpath '/home/oracle/store_sales.csv' into table store_sales;
That file is not compressed, so I want to compress the table to make it work faster ( I don't know how to let hive work on a compress file directly), So I use the command
create table test as select * from store_sales,
in this way it create 113 files compressed in snappy. and each file is of size less than 10M (might because one snapply file is the result of compressing one block of HDFS), then is run any query, it always kick 113 map tasks. Since the cluster only has 3 nodes, so I need to let it run only 3 map task. I set mapred.min.split.size to 350M (total size of compressed files are less than 1G, so 350M*3 > 1G), but it still kicks off 113 map tasks. What parameter I need to enable to make it run 3 map tasks ?
-rw-r--r-- 3 oracle supergroup 10156524 2011-08-27 17:06 /user/hive/warehouse/test/000000_0.snappy
-rw-r--r-- 3 oracle supergroup 10063292 2011-08-27 17:06 /user/hive/warehouse/test/000001_0.snappy
-rw-r--r-- 3 oracle supergroup 10057315 2011-08-27 17:06 /user/hive/warehouse/test/000002_0.snappy
-rw-r--r-- 3 oracle supergroup 10016039 2011-08-27 17:06 /user/hive/warehouse/test/000003_0.snappy
-rw-r--r-- 3 oracle supergroup 9845530 2011-08-27 17:06 /user/hive/warehouse/test/000004_0.snappy
-rw-r--r-- 3 oracle supergroup 9819626 2011-08-27 17:06 /user/hive/warehouse/test/000005_0.snappy
-rw-r--r-- 3 oracle supergroup 9801408 2011-08-27 17:07 /user/hive/warehouse/test/000006_0.snappy
-rw-r--r-- 3 oracle supergroup 9776102 2011-08-27 17:07 /user/h ive/warehouse/test/000007_0.snappy
-rw-r--r-- 3 oracle supergroup 9772285 2011-08-27 17:07 /user/hive/warehouse/test/000008_0.snappy
-rw-r--r-- 3 oracle supergroup 9764841 2011-08-27 17:07 /user/hive/warehouse/test/000009_0.snappy
-rw-r--r-- 3 oracle supergroup 9738481 2011-08-27 17:07 /user/hive/warehouse/test/000010_0.snappy
-rw-r--r-- 3 oracle supergroup 9694980 2011-08-27 17:07 /user/hive/warehouse/test/000011_0.snappy
-rw-r--r-- 3 oracle supergroup 9663682 2011-08-27 17:07 /user/hive/warehouse/test/000012_0.snappy
-rw-r--r-- 3 oracle supergroup 9643515 2011-08-27 17:07 /user/hive/warehouse/test/000013_0.snappy
-rw-r--r-- 3 oracle supergroup 9634152 2011-08-27 17:07 /user/hive/warehouse/test/000014_0.snappy
-rw-r--r-- 3 oracle su pergroup 9631661 2011-08-27 17:07 /user/hive/warehouse/test/000015_0.snappy
-rw-r--r-- 3 oracle supergroup 9625304 2011-08-27 17:07 /user/hive/warehouse/test/000016_0.snappy
-rw-r--r-- 3 oracle supergroup 9617673 2011-08-27 17:07 /user/hive/warehouse/test/000017_0.snappy
-rw-r--r-- 3 oracle supergroup 9612474 2011-08-27 17:08 /user/hive/warehouse/test/000018_0.snappy
-rw-r--r-- 3 oracle supergroup 9608600 2011-08-27 17:08 /user/hive/warehouse/test/000019_0.snappy
-rw-r--r-- 3 oracle supergroup 9600738 2011-08-27 17:08 /user/hive/warehouse/test/000020_0.snappy
-rw-r--r-- 3 oracle supergroup 9555315 2011-08-27 17:08 /user/hive/warehouse/test/000021_0.snappy
-rw-r--r-- 3 oracle supergroup 9550699 2011-08-27 17:08 /user/hive/warehouse /test/000022_0.snappy
-rw-r--r-- 3 oracle supergroup 9550166 2011-08-27 17:08 /user/hive/warehouse/test/000023_0.snappy
-rw-r--r-- 3 oracle supergroup 9546121 2011-08-27 17:08 /user/hive/warehouse/test/000024_0.snappy
-rw-r--r-- 3 oracle supergroup 9542885 2011-08-27 17:08 /user/hive/warehouse/test/000025_0.snappy