You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Devaraj Das <dd...@yahoo-inc.com> on 2007/07/01 19:45:16 UTC

RE: Sort inputs, outputs

> 1. Do I need to setup files specially for them to work with sort?  My
self-made test 
> files always causes the map tasks to fail.  They're just text files with
lines such as 
> "123456 abcdef", "789012 ghijkl", etc.

What sort of failures do you get? I am assuming that you want to sort based
on the value of the first numerical string in every line. I don't know what
inputformat you are using. So, for your example, the closest built-in
inputformat is TextInputFormat. If you use that you would get the key and
value as the offset (as a LongWritable object) of the line in the file and
the line (as a Text object). You would then need to extract the two parts,
create a IntWritable or a LongWritable object of the first part, and, create
a Text object of the second part, and then do output.collect treating the
first part as the key and the second part as the value. Have a look at
org.apache.hadoop.examples.WordCount.MapClass.map to get a better feel for
what I am saying.

> 2. How do I check to make sure the sort output is truly sorted, when using
the 
> randomwriter + sort test?  Is there any specific way to view the output
files?

You could use the SortValidator. So, if you ran sort on the input directory,
inputDir, and created the output directory, outputDir, run the sortvalidator
as 
bin/hadoop jar build/hadoop-<version#>-dev-test.jar testmapredsort
-sortInput <input-path> -sortOutput <sort-output>

> 3. Are the outputs of the test programs typically part-00000, part-00001,
...part-XXXXX?  
> Is there any suggested method for merging them?

Yes. You could run another mapreduce job with exactly one reduce to merge
them.

-----Original Message-----
From: Kevin Lim [mailto:ktlim@umich.edu] 
Sent: Friday, June 29, 2007 2:56 AM
To: hadoop-user@lucene.apache.org
Subject: Sort inputs, outputs

Hi,

I have setup hadoop on 2 machines and am now trying to see if it is working
properly.  I have 3 questions:
1. Do I need to setup files specially for them to work with sort?  My
self-made test files always causes the map tasks to fail.  They're just text
files with lines such as "123456 abcdef", "789012 ghijkl", etc.

2. How do I check to make sure the sort output is truly sorted, when using
the randomwriter + sort test?  Is there any specific way to view the output
files?

3. Are the outputs of the test programs typically part-00000, part-00001,
...part-XXXXX?  Is there any suggested method for merging them?

Thanks,

Kevin Lim


Re: Sort inputs, outputs

Posted by "Peter W." <pe...@marketingbrokers.com>.
Hi,

You could also do this the old fashioned way:

       try
          {
          JobConf jc=new JobConf(sample.class);
	 ...
          jc.setInputPath(new Path(IN_DIR));
          jc.setOutputPath(new Path(OUT_DIR));
          JobClient.runJob(jc);
	 }
       catch(Exception e){System.out.println(e);}

       String[] d=new File(OUT_DIR).list();
       Arrays.sort(d);

       for(int dint=0;dint<d.length;dint++)
          {
	 if(!d[dint].startsWith("."))	// no dot files
	    {
	    String PART_FILE=OUT_DIR+"/"+d[dint];	// part path

	    // create file object from part path string,
	    // do file merge,append, cleanup

	    }
	 }

Later,

Peter W.


On Jul 1, 2007, at 10:45 AM, Devaraj Das wrote:

>> 3. Are the outputs of the test programs typically part-00000,  
>> part-00001,
> ...part-XXXXX?
>> Is there any suggested method for merging them?
>
> Yes. You could run another mapreduce job with exactly one reduce to  
> merge
> them.