You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Shashidhar Rao <ra...@gmail.com> on 2014/04/14 15:58:18 UTC

How multiple input files are processed by mappers

Hi,

Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
and I want to put the files in HDFS. And all the files combine together the
size is 10 TB but each file is roughly say 1GB  only and the total number
of files 10 files

1. In real production environment do we copy these 10 files in hdfs under a
folder one by one. If this is the case then how many mappers do we specify
10 mappers. And do we use put command of hadoop to transfer this file.

2. If the above is not the case then do we pre-process to merge these 10
files make it one file of size 10 TB and copy this in hdfs .

Regards
Shashidhar

Re: How multiple input files are processed by mappers

Posted by Alok Kumar <al...@gmail.com>.

Hi,

You can just use put command to load file into HDFS
https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#put

Copying files into hdfs won't require mapper or map-reduce job;
It depends on your processing logic ( map-reduce code ) if you really
require to have single merged file.
Also, you can set map.input.dir directory path in job configuration.

Regards
Alok


On Mon, Apr 14, 2014 at 9:58 AM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

Re: How multiple input files are processed by mappers

Posted by Nitin Pawar <ni...@gmail.com>.

1. In real production environment do we copy these 10 files in hdfs under a
folder one by one. If this is the case then how many mappers do we specify
10 mappers. And do we use put command of hadoop to transfer this file.

Ans: This will depend on what you want to do with files. There is no rule
which says that all files need to go in one folder one.
While uploading files to hdfs via dfs clients (native hadoop cli or your
java dfs client), they do not need mappers. Its a file system operation.
Remember mappers will be involved only if you call mapreduce framework for
processing the files or writing the files. In normal file uploads, its only
dfs operations.

2. If the above is not the case then do we pre-process to merge these 10
files make it one file of size 10 TB and copy this in hdfs.

Ans: You do not need to merge the files outside and put them on hdfs as
long as individual files are of fair enough sized . When it goes to hdfs,
do you want to merge it that again depends on the purpose you want to use
it for.

On Mon, Apr 14, 2014 at 7:28 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

-- 
Nitin Pawar

Re: How multiple input files are processed by mappers

Posted by Nitin Pawar <ni...@gmail.com>.

1. In real production environment do we copy these 10 files in hdfs under a
folder one by one. If this is the case then how many mappers do we specify
10 mappers. And do we use put command of hadoop to transfer this file.

Ans: This will depend on what you want to do with files. There is no rule
which says that all files need to go in one folder one.
While uploading files to hdfs via dfs clients (native hadoop cli or your
java dfs client), they do not need mappers. Its a file system operation.
Remember mappers will be involved only if you call mapreduce framework for
processing the files or writing the files. In normal file uploads, its only
dfs operations.

2. If the above is not the case then do we pre-process to merge these 10
files make it one file of size 10 TB and copy this in hdfs.

Ans: You do not need to merge the files outside and put them on hdfs as
long as individual files are of fair enough sized . When it goes to hdfs,
do you want to merge it that again depends on the purpose you want to use
it for.

On Mon, Apr 14, 2014 at 7:28 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

-- 
Nitin Pawar

Re: How multiple input files are processed by mappers

Posted by Alok Kumar <al...@gmail.com>.

Hi,

You can just use put command to load file into HDFS
https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#put

Copying files into hdfs won't require mapper or map-reduce job;
It depends on your processing logic ( map-reduce code ) if you really
require to have single merged file.
Also, you can set map.input.dir directory path in job configuration.

Regards
Alok


On Mon, Apr 14, 2014 at 9:58 AM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

Re: How multiple input files are processed by mappers

Posted by Nitin Pawar <ni...@gmail.com>.

1. In real production environment do we copy these 10 files in hdfs under a
folder one by one. If this is the case then how many mappers do we specify
10 mappers. And do we use put command of hadoop to transfer this file.

Ans: This will depend on what you want to do with files. There is no rule
which says that all files need to go in one folder one.
While uploading files to hdfs via dfs clients (native hadoop cli or your
java dfs client), they do not need mappers. Its a file system operation.
Remember mappers will be involved only if you call mapreduce framework for
processing the files or writing the files. In normal file uploads, its only
dfs operations.

2. If the above is not the case then do we pre-process to merge these 10
files make it one file of size 10 TB and copy this in hdfs.

Ans: You do not need to merge the files outside and put them on hdfs as
long as individual files are of fair enough sized . When it goes to hdfs,
do you want to merge it that again depends on the purpose you want to use
it for.

On Mon, Apr 14, 2014 at 7:28 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

-- 
Nitin Pawar

Re: How multiple input files are processed by mappers

Posted by Alok Kumar <al...@gmail.com>.

Hi,

You can just use put command to load file into HDFS
https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#put

Copying files into hdfs won't require mapper or map-reduce job;
It depends on your processing logic ( map-reduce code ) if you really
require to have single merged file.
Also, you can set map.input.dir directory path in job configuration.

Regards
Alok


On Mon, Apr 14, 2014 at 9:58 AM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

Re: How multiple input files are processed by mappers

Posted by Nitin Pawar <ni...@gmail.com>.

1. In real production environment do we copy these 10 files in hdfs under a
folder one by one. If this is the case then how many mappers do we specify
10 mappers. And do we use put command of hadoop to transfer this file.

Ans: This will depend on what you want to do with files. There is no rule
which says that all files need to go in one folder one.
While uploading files to hdfs via dfs clients (native hadoop cli or your
java dfs client), they do not need mappers. Its a file system operation.
Remember mappers will be involved only if you call mapreduce framework for
processing the files or writing the files. In normal file uploads, its only
dfs operations.

2. If the above is not the case then do we pre-process to merge these 10
files make it one file of size 10 TB and copy this in hdfs.

Ans: You do not need to merge the files outside and put them on hdfs as
long as individual files are of fair enough sized . When it goes to hdfs,
do you want to merge it that again depends on the purpose you want to use
it for.

On Mon, Apr 14, 2014 at 7:28 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>

-- 
Nitin Pawar

Re: How multiple input files are processed by mappers

Posted by Alok Kumar <al...@gmail.com>.

Hi,

You can just use put command to load file into HDFS
https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#put

Copying files into hdfs won't require mapper or map-reduce job;
It depends on your processing logic ( map-reduce code ) if you really
require to have single merged file.
Also, you can set map.input.dir directory path in job configuration.

Regards
Alok


On Mon, Apr 14, 2014 at 9:58 AM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi,
>
> Please can somebody clarify my doubts. Say. I have a cluster of 30 nodes
> and I want to put the files in HDFS. And all the files combine together the
> size is 10 TB but each file is roughly say 1GB  only and the total number
> of files 10 files
>
> 1. In real production environment do we copy these 10 files in hdfs under
> a folder one by one. If this is the case then how many mappers do we
> specify 10 mappers. And do we use put command of hadoop to transfer this
> file.
>
> 2. If the above is not the case then do we pre-process to merge these 10
> files make it one file of size 10 TB and copy this in hdfs .
>
> Regards
> Shashidhar
>
>
>