You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Raj Hadoop <ha...@yahoo.com> on 2014/02/26 02:42:20 UTC

part-m-00000 files and their size - Hive table

Hi,

I am loading data to HDFS files through sqoop and creating a Hive table to point to these files.

The mapper files through sqoop example are generated like this below.

part-m-00000

 part-m-00001

part-m-00002

My question is -
1) For Hive query performance , how important or significant is the distribution of the file sizes above.

part_m_0 say 1 GB
part_m_1 say 3 GB
part_m_1 say 0.25 GB

Vs

part_m_0 say 1.4 GB
part_m_1 say 1.4 GB
part_m_1 say  1.45 B


NOTE : The size and no of files is just for sample. The real numbers are far bigger.


I am assuming the uniform distribution has a performance benefit .

If so, what is the reason and can I know the technical details.

Re: part-m-00000 files and their size - Hive table

Posted by Raj Hadoop <ha...@yahoo.com>.

Thanks for the detailed explanation Yong. It helps.

Regards,
Raj





On Tuesday, February 25, 2014 9:18 PM, java8964 <ja...@hotmail.com> wrote:
 
Yes, it is good that the file sizes are evenly close, but not very important, unless there are files very small (compared to the block size).

The reasons are:

Your files should be splitable to be used in Hadoop (Or in Hive, it is the same thing). If they are splitable, then 1G file will use 10 blocks (assume the block size is 128M), and 256M file will take 2 blocks. So these 2 files will generate 12 mapper tasks, and will be equally run in your cluster. From performance point of view, you have 12 mapper tasks, and they are equally processed in the cluster. So one 1G file plus one 256M file are not big deal. But if you have one file are very small, like 10M, that one file will also consume one mapper task, and that is kind of bad for performance, as hadoop starting one mapper task only consuming 10M data, which is bad, because starting/stop tasks is using quite some resource, but only processing 10M data.

The reason you see unevenly file size of the output of sqoop is that it is hard for sqoop to split your source data evenly. For example, if you dump table A from DB to hive, sqoop will do the following:

1) Identify the primary/unique keys of the table.
2) Find out the min/max value of the keys, let say they are (1 to 1,000,000)
3) Based on # of your mapper task, split them. If you run sqoop with 4 mappers, then the data will be split into 4 groups (1, 250,000) (250,001, 500,000) (500,001, 750,000) (750,001, 1,000,000). As you can image, your data most likely are not even distributed by the primary keys in that 4 groups, then you will get unevenly output as part-m-xxx files.

Keep in mind that it is not required to use primary keys or unique keys as the split column. So you can choose whateven column in your table make sense. Pick up whateven can make the split more even.

Yong



________________________________
Date: Tue, 25 Feb 2014 17:42:20 -0800
From: hadoopraj@yahoo.com
Subject: part-m-00000 files and their size - Hive table
To: user@hive.apache.org


Hi,

I am loading data to HDFS files through sqoop and creating a Hive table to point to these files.

The mapper files through sqoop example are generated like this below.

part-m-00000

 part-m-00001

part-m-00002

My question is -
1) For Hive query performance , how important or significant is the distribution of the file sizes above.

part_m_0 say 1 GB
part_m_1 say 3 GB
part_m_1 say 0.25 GB

Vs

part_m_0 say 1.4 GB
part_m_1 say 1.4 GB
part_m_1 say  1.45 B


NOTE : The size and no of files is just for sample. The real numbers are far bigger.


I am assuming the uniform distribution has a performance benefit .

If so, what is the reason and can I know the technical details.

RE: part-m-00000 files and their size - Hive table

Posted by java8964 <ja...@hotmail.com>.

Yes, it is good that the file sizes are evenly close, but not very important, unless there are files very small (compared to the block size).
The reasons are:
Your files should be splitable to be used in Hadoop (Or in Hive, it is the same thing). If they are splitable, then 1G file will use 10 blocks (assume the block size is 128M), and 256M file will take 2 blocks. So these 2 files will generate 12 mapper tasks, and will be equally run in your cluster. From performance point of view, you have 12 mapper tasks, and they are equally processed in the cluster. So one 1G file plus one 256M file are not big deal. But if you have one file are very small, like 10M, that one file will also consume one mapper task, and that is kind of bad for performance, as hadoop starting one mapper task only consuming 10M data, which is bad, because starting/stop tasks is using quite some resource, but only processing 10M data.
The reason you see unevenly file size of the output of sqoop is that it is hard for sqoop to split your source data evenly. For example, if you dump table A from DB to hive, sqoop will do the following:
1) Identify the primary/unique keys of the table.2) Find out the min/max value of the keys, let say they are (1 to 1,000,000)3) Based on # of your mapper task, split them. If you run sqoop with 4 mappers, then the data will be split into 4 groups (1, 250,000) (250,001, 500,000) (500,001, 750,000) (750,001, 1,000,000). As you can image, your data most likely are not even distributed by the primary keys in that 4 groups, then you will get unevenly output as part-m-xxx files.
Keep in mind that it is not required to use primary keys or unique keys as the split column. So you can choose whateven column in your table make sense. Pick up whateven can make the split more even.
Yong
Date: Tue, 25 Feb 2014 17:42:20 -0800
From: hadoopraj@yahoo.com
Subject: part-m-00000 files and their size - Hive table
To: user@hive.apache.org

Hi,
I am loading data to HDFS files through sqoop and creating a Hive table to point to these files.
The mapper files through sqoop example are generated like this below.

                part-m-00000

                part-m-00001
part-m-00002
My question is -1) For Hive query performance , how important or significant is the distribution of the file sizes above.
part_m_0 say 1 GBpart_m_1 say 3 GBpart_m_1 say 0.25 GB
Vs
part_m_0 say 1.4
 GBpart_m_1 say 1.4 GBpart_m_1 say  1.45 B

NOTE : The size and no of files is just for sample. The real numbers are far bigger.

I am assuming the uniform distribution has a performance benefit .
If so, what is the reason and can I know the technical details.