You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matthew John <tm...@gmail.com> on 2010/11/15 06:37:20 UTC

Tweaking the File write in HDFS

Hi all ,

I have been working with MapReduce and HDFS for sometime. So the procedure
what I normally follow is :

1) copy in the input file from Local File System to HDFS

2) run the map reduce module

3) copy the output file back to the Local File System from the HDFS

But I feel , step 1 and 3 is  adding a lot of overhead to the entire process
!!

My queries are :

1) I am getting the files into the Local File System by establishing a port
connection with another node. So can I ensure that the data which is ported
into the hadoop node is directly written to the HDFS instead of going
through the Local File System and then performing a CopyFromLocal ???

2) Can I copy the reduce output (which creates the final output file)
directly to the Local File System instead of injecting it to the HDFS
(effectively into different nodes in HDFS), so that I can minimize the
overhead ?? I expect this procedure to take much lesser time than copying to
the HDFS and then performing a CopyToLocal.. Finally I should be able to
send this file back to another node using socket communication..

Looking forward to your suggestions !!

Thanks,

Matthew John

Re: Tweaking the File write in HDFS

Posted by Zooni79 <zo...@gmail.com>.
Hi,

As an extension to the problem statement...Is it possible to fuse step 1 and 2 in to one step?
i.e. Can we have the map task to pick the input from an external filesystem instead of HDFS.
Can FTPfileSystem/RawLocalFileSystem can be of any help here?

./zahoor

On 15-Nov-2010, at 3:10 PM, Sebastian Schoenherr wrote:

> Hi Matthew,
> of course, you can copy it directly to HDFS and vice versa. Use the IOUtils (hadoop.io.IOUtils) like this:
> FileSystem fileSystem = FileSystem.get(conf); (org.apache.hadoop.fs.FileSystem)
> 
> "in" and "out" are the streams (out is in this example the HDFS outputstream)
> IOUtils.copyBytes(in, out, fileSystem.getConf());
> 
> hope this helps,
> sebastian
> 
> Zitat von Matthew John <tm...@gmail.com>:
> 
>> Hi all ,
>> 
>> I have been working with MapReduce and HDFS for sometime. So the procedure
>> what I normally follow is :
>> 
>> 1) copy in the input file from Local File System to HDFS
>> 
>> 2) run the map reduce module
>> 
>> 3) copy the output file back to the Local File System from the HDFS
>> 
>> But I feel , step 1 and 3 is  adding a lot of overhead to the entire process
>> !!
>> 
>> My queries are :
>> 
>> 1) I am getting the files into the Local File System by establishing a port
>> connection with another node. So can I ensure that the data which is ported
>> into the hadoop node is directly written to the HDFS instead of going
>> through the Local File System and then performing a CopyFromLocal ???
>> 
>> 2) Can I copy the reduce output (which creates the final output file)
>> directly to the Local File System instead of injecting it to the HDFS
>> (effectively into different nodes in HDFS), so that I can minimize the
>> overhead ?? I expect this procedure to take much lesser time than copying to
>> the HDFS and then performing a CopyToLocal.. Finally I should be able to
>> send this file back to another node using socket communication..
>> 
>> Looking forward to your suggestions !!
>> 
>> Thanks,
>> 
>> Matthew John
>> 
> 
> 
> 


Re: Tweaking the File write in HDFS

Posted by Sebastian Schoenherr <Se...@student.uibk.ac.at>.
Hi Matthew,
of course, you can copy it directly to HDFS and vice versa. Use the  
IOUtils (hadoop.io.IOUtils) like this:
FileSystem fileSystem = FileSystem.get(conf);  
(org.apache.hadoop.fs.FileSystem)

"in" and "out" are the streams (out is in this example the HDFS outputstream)
IOUtils.copyBytes(in, out, fileSystem.getConf());

hope this helps,
sebastian

Zitat von Matthew John <tm...@gmail.com>:

> Hi all ,
>
> I have been working with MapReduce and HDFS for sometime. So the procedure
> what I normally follow is :
>
> 1) copy in the input file from Local File System to HDFS
>
> 2) run the map reduce module
>
> 3) copy the output file back to the Local File System from the HDFS
>
> But I feel , step 1 and 3 is  adding a lot of overhead to the entire process
> !!
>
> My queries are :
>
> 1) I am getting the files into the Local File System by establishing a port
> connection with another node. So can I ensure that the data which is ported
> into the hadoop node is directly written to the HDFS instead of going
> through the Local File System and then performing a CopyFromLocal ???
>
> 2) Can I copy the reduce output (which creates the final output file)
> directly to the Local File System instead of injecting it to the HDFS
> (effectively into different nodes in HDFS), so that I can minimize the
> overhead ?? I expect this procedure to take much lesser time than copying to
> the HDFS and then performing a CopyToLocal.. Finally I should be able to
> send this file back to another node using socket communication..
>
> Looking forward to your suggestions !!
>
> Thanks,
>
> Matthew John
>