You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Saptarshi Guha <sa...@gmail.com> on 2009/03/16 16:59:45 UTC

Task Side Effect files and copying(getWorkOutputPath)

Hello,
I would like to produce side effect files which will be later copied
to the outputfolder.
I am using FileOuputFormat, and in the Map's close() method i copy
files (from the local tmp/ folder) to
FileOutputFormat.getWorkOutputPath(job);

void close() .... {
    if (shouldcopy) {
		ArrayList<Path> lop = new ArrayList<Path>();
		for(String ff :  tempdir.list()){
		    lop.add(new Path(temppfx+ff));
		}
		dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
	    }

However, this throws an error java.io.IOException:
`hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_000000_0':
specified destination directory doest not exist

I though this is the right to place to drop side effect files. Prior
to this I was copying o the output folder, but many were not copied,
or in fact all were, but during the reduce output stage many were
deleted - am not sure(I used NullOutputFormat and all the files were
present in the output folder)  So i resorted to getWorkOutputPath
which threw the above exception.

So if I'm using FileOutputFormat, and my maps and/or reduces produce
side effects files on the localFS
1)when should I copy them to the DFS (e.g the close method? or one at
a time in the map/reduce method)
2) Where should i copy them to.

I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
Also, each side effect file produced has a unique name, i.e there is
no overwriting.

Thank you
Saptarshi Guha

Re: Task Side Effect files and copying(getWorkOutputPath)

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
Saptarshi Guha wrote:
> Hello,
> I would like to produce side effect files which will be later copied
> to the outputfolder.
> I am using FileOuputFormat, and in the Map's close() method i copy
> files (from the local tmp/ folder) to
> FileOutputFormat.getWorkOutputPath(job);
>
>   
FileOutputFormat.getWorkOutputPath(job) is the correct method to get directory for task-side effect files.

You should not use close() method, because promotion to output directory 
happens before close(). You can use configure() method.
See org.apache.hadoop.tools.HadoopArchives.
> void close() .... {
>     if (shouldcopy) {
> 		ArrayList<Path> lop = new ArrayList<Path>();
> 		for(String ff :  tempdir.list()){
> 		    lop.add(new Path(temppfx+ff));
> 		}
> 		dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
> 	    }
>
> However, this throws an error java.io.IOException:
> `hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_000000_0':
> specified destination directory doest not exist
>
> I though this is the right to place to drop side effect files. Prior
> to this I was copying o the output folder, but many were not copied,
> or in fact all were, but during the reduce output stage many were
> deleted - am not sure(I used NullOutputFormat and all the files were
> present in the output folder)  So i resorted to getWorkOutputPath
> which threw the above exception.
>
> So if I'm using FileOutputFormat, and my maps and/or reduces produce
> side effects files on the localFS
> 1)when should I copy them to the DFS (e.g the close method? or one at
> a time in the map/reduce method)
> 2) Where should i copy them to.
>
> I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
> Also, each side effect file produced has a unique name, i.e there is
> no overwriting.
>   
You need not set jobConf.setNumTasksToExecutePerJvm(-1), even otherwise, 
each attempt will have unique work output path.

Thanks
Amareshwari