You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by "Natarajan, Prabakaran 1. (NSN - IN/Bangalore)" <pr...@nsn.com> on 2014/07/17 09:52:31 UTC

Multiple Part files

Hi

After Map Reduce job, we are seeing multiple small part files in the output directory. We are using RC file format (snappy codec)

1)      Do each part file will take 64MB block size?
2)      How to merge these multiple RC format part files into one RC file?
3)      What is the pros-cons of having multiple part files?
4)      Do merging part files will improve performance?

Thanks and Regards
Prabakaran.N  aka NP
nsn, Bangalore
When "I" is replaced by "We" - even Illness becomes "Wellness"





Re: Multiple Part files

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hadoop has a getmerge command (
http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#getmerge) command,
I'm not certain if it works with RC file, i think it should. So maybe you
don't have to copy the files to local.


On Thu, Jul 17, 2014 at 6:18 AM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

>  Hi Prabakaran,
>
>
>
>      Multiple small part files in the output directory is because each
> reducer task output is coming as one part file.
>
>    1. Do each part file will take 64MB block size?
>
> *                Based on the output size of the reducer one part file is
> created. Filesize can be smaller size than the hdfs block size, i.e. it not
> be mandatorily be of 64MB*
>
>
>
>         2. How to merge these multiple RC format part files into one RC
> file?
>
>                 *One way (may be longer way ) is to get the part files in
> to local diretory and  write a tool to merge all the RC files. *
>
> *                But anyway i feel in the first place we need to ensure we
> have single reducer so that there is no need for merging*
>
>
>
>         3.  What is the pros-cons of having multiple part files?
>
> *                Depends on the next operation what you want to do, *
>
> *                Like if you are planning to load into Hive then based on
> Hive paritions better to configure the MR  to be partitioned as per Hive
> partiions and loading would be easier? etc ... *
>
>
>
>         4.  Do merging part files will improve performance?
>
>                     Performance of the Map reduce or later operation ? I
> think if the overall scenario is known then we will be able to support
> better
>
>
>
>  Regards,
>
> Naga
>
>
>
> Huawei Technologies Co., Ltd.
> Phone:
> Fax:
> Mobile:  +91 9980040283
> Email: naganarasimhagr@huawei.com
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
>
>  ¡
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>   ------------------------------
> *From:* Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [
> prabakaran.1.natarajan@nsn.com]
> *Sent:* Thursday, July 17, 2014 15:52
> *To:* user@hadoop.apache.org
> *Subject:* Multiple Part files
>
>   Hi
>
> After Map Reduce job, we are seeing multiple small part files in the
> output directory. We are using RC file format (snappy codec)
>
>
>    1. Do each part file will take 64MB block size?
>    2. How to merge these multiple RC format part files into one RC file?
>    3. What is the pros-cons of having multiple part files?
>    4. Do merging part files will improve performance?
>
>
> *Thanks and Regards*
> Prabakaran.N  aka NP
> nsn, Bangalore
> *When "I" is replaced by "We" - even Illness becomes "Wellness"*
>
>
>
>
>

Re: Multiple Part files

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hadoop has a getmerge command (
http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#getmerge) command,
I'm not certain if it works with RC file, i think it should. So maybe you
don't have to copy the files to local.


On Thu, Jul 17, 2014 at 6:18 AM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

>  Hi Prabakaran,
>
>
>
>      Multiple small part files in the output directory is because each
> reducer task output is coming as one part file.
>
>    1. Do each part file will take 64MB block size?
>
> *                Based on the output size of the reducer one part file is
> created. Filesize can be smaller size than the hdfs block size, i.e. it not
> be mandatorily be of 64MB*
>
>
>
>         2. How to merge these multiple RC format part files into one RC
> file?
>
>                 *One way (may be longer way ) is to get the part files in
> to local diretory and  write a tool to merge all the RC files. *
>
> *                But anyway i feel in the first place we need to ensure we
> have single reducer so that there is no need for merging*
>
>
>
>         3.  What is the pros-cons of having multiple part files?
>
> *                Depends on the next operation what you want to do, *
>
> *                Like if you are planning to load into Hive then based on
> Hive paritions better to configure the MR  to be partitioned as per Hive
> partiions and loading would be easier? etc ... *
>
>
>
>         4.  Do merging part files will improve performance?
>
>                     Performance of the Map reduce or later operation ? I
> think if the overall scenario is known then we will be able to support
> better
>
>
>
>  Regards,
>
> Naga
>
>
>
> Huawei Technologies Co., Ltd.
> Phone:
> Fax:
> Mobile:  +91 9980040283
> Email: naganarasimhagr@huawei.com
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
>
>  ¡
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>   ------------------------------
> *From:* Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [
> prabakaran.1.natarajan@nsn.com]
> *Sent:* Thursday, July 17, 2014 15:52
> *To:* user@hadoop.apache.org
> *Subject:* Multiple Part files
>
>   Hi
>
> After Map Reduce job, we are seeing multiple small part files in the
> output directory. We are using RC file format (snappy codec)
>
>
>    1. Do each part file will take 64MB block size?
>    2. How to merge these multiple RC format part files into one RC file?
>    3. What is the pros-cons of having multiple part files?
>    4. Do merging part files will improve performance?
>
>
> *Thanks and Regards*
> Prabakaran.N  aka NP
> nsn, Bangalore
> *When "I" is replaced by "We" - even Illness becomes "Wellness"*
>
>
>
>
>

Re: Multiple Part files

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hadoop has a getmerge command (
http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#getmerge) command,
I'm not certain if it works with RC file, i think it should. So maybe you
don't have to copy the files to local.


On Thu, Jul 17, 2014 at 6:18 AM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

>  Hi Prabakaran,
>
>
>
>      Multiple small part files in the output directory is because each
> reducer task output is coming as one part file.
>
>    1. Do each part file will take 64MB block size?
>
> *                Based on the output size of the reducer one part file is
> created. Filesize can be smaller size than the hdfs block size, i.e. it not
> be mandatorily be of 64MB*
>
>
>
>         2. How to merge these multiple RC format part files into one RC
> file?
>
>                 *One way (may be longer way ) is to get the part files in
> to local diretory and  write a tool to merge all the RC files. *
>
> *                But anyway i feel in the first place we need to ensure we
> have single reducer so that there is no need for merging*
>
>
>
>         3.  What is the pros-cons of having multiple part files?
>
> *                Depends on the next operation what you want to do, *
>
> *                Like if you are planning to load into Hive then based on
> Hive paritions better to configure the MR  to be partitioned as per Hive
> partiions and loading would be easier? etc ... *
>
>
>
>         4.  Do merging part files will improve performance?
>
>                     Performance of the Map reduce or later operation ? I
> think if the overall scenario is known then we will be able to support
> better
>
>
>
>  Regards,
>
> Naga
>
>
>
> Huawei Technologies Co., Ltd.
> Phone:
> Fax:
> Mobile:  +91 9980040283
> Email: naganarasimhagr@huawei.com
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
>
>  ¡
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>   ------------------------------
> *From:* Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [
> prabakaran.1.natarajan@nsn.com]
> *Sent:* Thursday, July 17, 2014 15:52
> *To:* user@hadoop.apache.org
> *Subject:* Multiple Part files
>
>   Hi
>
> After Map Reduce job, we are seeing multiple small part files in the
> output directory. We are using RC file format (snappy codec)
>
>
>    1. Do each part file will take 64MB block size?
>    2. How to merge these multiple RC format part files into one RC file?
>    3. What is the pros-cons of having multiple part files?
>    4. Do merging part files will improve performance?
>
>
> *Thanks and Regards*
> Prabakaran.N  aka NP
> nsn, Bangalore
> *When "I" is replaced by "We" - even Illness becomes "Wellness"*
>
>
>
>
>

Re: Multiple Part files

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hadoop has a getmerge command (
http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#getmerge) command,
I'm not certain if it works with RC file, i think it should. So maybe you
don't have to copy the files to local.


On Thu, Jul 17, 2014 at 6:18 AM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

>  Hi Prabakaran,
>
>
>
>      Multiple small part files in the output directory is because each
> reducer task output is coming as one part file.
>
>    1. Do each part file will take 64MB block size?
>
> *                Based on the output size of the reducer one part file is
> created. Filesize can be smaller size than the hdfs block size, i.e. it not
> be mandatorily be of 64MB*
>
>
>
>         2. How to merge these multiple RC format part files into one RC
> file?
>
>                 *One way (may be longer way ) is to get the part files in
> to local diretory and  write a tool to merge all the RC files. *
>
> *                But anyway i feel in the first place we need to ensure we
> have single reducer so that there is no need for merging*
>
>
>
>         3.  What is the pros-cons of having multiple part files?
>
> *                Depends on the next operation what you want to do, *
>
> *                Like if you are planning to load into Hive then based on
> Hive paritions better to configure the MR  to be partitioned as per Hive
> partiions and loading would be easier? etc ... *
>
>
>
>         4.  Do merging part files will improve performance?
>
>                     Performance of the Map reduce or later operation ? I
> think if the overall scenario is known then we will be able to support
> better
>
>
>
>  Regards,
>
> Naga
>
>
>
> Huawei Technologies Co., Ltd.
> Phone:
> Fax:
> Mobile:  +91 9980040283
> Email: naganarasimhagr@huawei.com
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
>
>  ¡
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>   ------------------------------
> *From:* Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [
> prabakaran.1.natarajan@nsn.com]
> *Sent:* Thursday, July 17, 2014 15:52
> *To:* user@hadoop.apache.org
> *Subject:* Multiple Part files
>
>   Hi
>
> After Map Reduce job, we are seeing multiple small part files in the
> output directory. We are using RC file format (snappy codec)
>
>
>    1. Do each part file will take 64MB block size?
>    2. How to merge these multiple RC format part files into one RC file?
>    3. What is the pros-cons of having multiple part files?
>    4. Do merging part files will improve performance?
>
>
> *Thanks and Regards*
> Prabakaran.N  aka NP
> nsn, Bangalore
> *When "I" is replaced by "We" - even Illness becomes "Wellness"*
>
>
>
>
>

RE: Multiple Part files

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.
Hi Prabakaran,



     Multiple small part files in the output directory is because each reducer task output is coming as one part file.

  1.  Do each part file will take 64MB block size?

                Based on the output size of the reducer one part file is created. Filesize can be smaller size than the hdfs block size, i.e. it not be mandatorily be of 64MB



        2. How to merge these multiple RC format part files into one RC file?

                One way (may be longer way ) is to get the part files in to local diretory and  write a tool to merge all the RC files.

                But anyway i feel in the first place we need to ensure we have single reducer so that there is no need for merging



        3.  What is the pros-cons of having multiple part files?

                Depends on the next operation what you want to do,

                Like if you are planning to load into Hive then based on Hive paritions better to configure the MR  to be partitioned as per Hive partiions and loading would be easier? etc ...



        4.  Do merging part files will improve performance?

                    Performance of the Map reduce or later operation ? I think if the overall scenario is known then we will be able to support better



Regards,

Naga



Huawei Technologies Co., Ltd.
Phone:
Fax:
Mobile:  +91 9980040283
Email: naganarasimhagr@huawei.com<ma...@huawei.com>
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

¡This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

________________________________
From: Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [prabakaran.1.natarajan@nsn.com]
Sent: Thursday, July 17, 2014 15:52
To: user@hadoop.apache.org
Subject: Multiple Part files

Hi

After Map Reduce job, we are seeing multiple small part files in the output directory. We are using RC file format (snappy codec)


  1.  Do each part file will take 64MB block size?
  2.  How to merge these multiple RC format part files into one RC file?
  3.  What is the pros-cons of having multiple part files?
  4.  Do merging part files will improve performance?


Thanks and Regards
Prabakaran.N  aka NP
nsn, Bangalore
When "I" is replaced by "We" - even Illness becomes "Wellness"





RE: Multiple Part files

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.
Hi Prabakaran,



     Multiple small part files in the output directory is because each reducer task output is coming as one part file.

  1.  Do each part file will take 64MB block size?

                Based on the output size of the reducer one part file is created. Filesize can be smaller size than the hdfs block size, i.e. it not be mandatorily be of 64MB



        2. How to merge these multiple RC format part files into one RC file?

                One way (may be longer way ) is to get the part files in to local diretory and  write a tool to merge all the RC files.

                But anyway i feel in the first place we need to ensure we have single reducer so that there is no need for merging



        3.  What is the pros-cons of having multiple part files?

                Depends on the next operation what you want to do,

                Like if you are planning to load into Hive then based on Hive paritions better to configure the MR  to be partitioned as per Hive partiions and loading would be easier? etc ...



        4.  Do merging part files will improve performance?

                    Performance of the Map reduce or later operation ? I think if the overall scenario is known then we will be able to support better



Regards,

Naga



Huawei Technologies Co., Ltd.
Phone:
Fax:
Mobile:  +91 9980040283
Email: naganarasimhagr@huawei.com<ma...@huawei.com>
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

¡This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

________________________________
From: Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [prabakaran.1.natarajan@nsn.com]
Sent: Thursday, July 17, 2014 15:52
To: user@hadoop.apache.org
Subject: Multiple Part files

Hi

After Map Reduce job, we are seeing multiple small part files in the output directory. We are using RC file format (snappy codec)


  1.  Do each part file will take 64MB block size?
  2.  How to merge these multiple RC format part files into one RC file?
  3.  What is the pros-cons of having multiple part files?
  4.  Do merging part files will improve performance?


Thanks and Regards
Prabakaran.N  aka NP
nsn, Bangalore
When "I" is replaced by "We" - even Illness becomes "Wellness"





RE: Multiple Part files

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.
Hi Prabakaran,



     Multiple small part files in the output directory is because each reducer task output is coming as one part file.

  1.  Do each part file will take 64MB block size?

                Based on the output size of the reducer one part file is created. Filesize can be smaller size than the hdfs block size, i.e. it not be mandatorily be of 64MB



        2. How to merge these multiple RC format part files into one RC file?

                One way (may be longer way ) is to get the part files in to local diretory and  write a tool to merge all the RC files.

                But anyway i feel in the first place we need to ensure we have single reducer so that there is no need for merging



        3.  What is the pros-cons of having multiple part files?

                Depends on the next operation what you want to do,

                Like if you are planning to load into Hive then based on Hive paritions better to configure the MR  to be partitioned as per Hive partiions and loading would be easier? etc ...



        4.  Do merging part files will improve performance?

                    Performance of the Map reduce or later operation ? I think if the overall scenario is known then we will be able to support better



Regards,

Naga



Huawei Technologies Co., Ltd.
Phone:
Fax:
Mobile:  +91 9980040283
Email: naganarasimhagr@huawei.com<ma...@huawei.com>
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

¡This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

________________________________
From: Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [prabakaran.1.natarajan@nsn.com]
Sent: Thursday, July 17, 2014 15:52
To: user@hadoop.apache.org
Subject: Multiple Part files

Hi

After Map Reduce job, we are seeing multiple small part files in the output directory. We are using RC file format (snappy codec)


  1.  Do each part file will take 64MB block size?
  2.  How to merge these multiple RC format part files into one RC file?
  3.  What is the pros-cons of having multiple part files?
  4.  Do merging part files will improve performance?


Thanks and Regards
Prabakaran.N  aka NP
nsn, Bangalore
When "I" is replaced by "We" - even Illness becomes "Wellness"





RE: Multiple Part files

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.
Hi Prabakaran,



     Multiple small part files in the output directory is because each reducer task output is coming as one part file.

  1.  Do each part file will take 64MB block size?

                Based on the output size of the reducer one part file is created. Filesize can be smaller size than the hdfs block size, i.e. it not be mandatorily be of 64MB



        2. How to merge these multiple RC format part files into one RC file?

                One way (may be longer way ) is to get the part files in to local diretory and  write a tool to merge all the RC files.

                But anyway i feel in the first place we need to ensure we have single reducer so that there is no need for merging



        3.  What is the pros-cons of having multiple part files?

                Depends on the next operation what you want to do,

                Like if you are planning to load into Hive then based on Hive paritions better to configure the MR  to be partitioned as per Hive partiions and loading would be easier? etc ...



        4.  Do merging part files will improve performance?

                    Performance of the Map reduce or later operation ? I think if the overall scenario is known then we will be able to support better



Regards,

Naga



Huawei Technologies Co., Ltd.
Phone:
Fax:
Mobile:  +91 9980040283
Email: naganarasimhagr@huawei.com<ma...@huawei.com>
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

¡This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

________________________________
From: Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [prabakaran.1.natarajan@nsn.com]
Sent: Thursday, July 17, 2014 15:52
To: user@hadoop.apache.org
Subject: Multiple Part files

Hi

After Map Reduce job, we are seeing multiple small part files in the output directory. We are using RC file format (snappy codec)


  1.  Do each part file will take 64MB block size?
  2.  How to merge these multiple RC format part files into one RC file?
  3.  What is the pros-cons of having multiple part files?
  4.  Do merging part files will improve performance?


Thanks and Regards
Prabakaran.N  aka NP
nsn, Bangalore
When "I" is replaced by "We" - even Illness becomes "Wellness"