You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Matthew John <tm...@gmail.com> on 2010/09/07 12:01:56 UTC

Sort with custom input/output format !!

Hey ,
M pretty new to Hadoop .

I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in
examples) for it.
My input metafile looks like this --> binary stream (only 1's and 0's). It
basically contains records of 40 bytes.
Every record goes like this :

long a; <key> --> 8 bytes. The rest of the structure will be the <value> -->
32 bytes
long b;
int c;
int d;
int e;
int unprocessed;
int compress_attempted;
int gatherer;


I have created a *FpMetaId.java (extends BytesWritable)* corresponding to
the <value> and *FpMetadata.java (extends BytesWritable)* corresponding to
the <key>.

My sole aim is to get these records (40 bytes) sorted with the fp (double)
as the key. And I need to write these sorted records back into a metafile
(exactly my old metafile but with sorted records----> binaries only).
I also implemented ::

*MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * --->
file making an input file format compatible to my record.
*MetafileOutputFormat<K, V> extends SequenceFileOutputFormat* ---> file
making the output file format compatible to my record.
*MetafileRecordReader.java (extends
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* --->
file implementing the record reader compatible to my record.

MetafileRecordWriter class has been implemented with in my
MetafileOutputFormat.java file.

Let me kindly get you through the sequence of events which followed :

1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata)
and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and
RecordReaders I implemented.

2) Writables I copied to /io folder. Other new files were copied to /mapred
folder. I successfully built it.

3) I modified the Sort file (the function I want to run with FpMetaId as key
and FpMetadata as value and imported these new classes in the file.) I
changed default conf settings to these required Writables and
RecordReaders.. I built hadoop using ant command after this. It successfully
got built.

*Q). Does this ensure all the new changes have got reflected on the jar. (
am I ready to go execute the sort function ?? )*

4) As I had already mentioned before, I am working with sequential file
format (binary) with a datastructure (key,value) repeating. So I wrote a C
code which generates random values for my datastructure and populated a file
, sequentially writing (binary) my (key,value)datastructure. I gave this as
my input for the sort which should sort my (key,values) with respect to
keys. I got the error : fp_input not a SequenceFile (fp_input is my input
file). I thought Seqfiles will just be stream of binaries.. Does it contain
any specific format ?

*Command used :  bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input
fp_output*

*Q) What does this imply ? I have no clue how to proceed further. Again, is
it because my jar file used to execute doesnt have the latest libraries ? I
could not get any good tutorials on this.
*

It would be great if someone can offer an helping hand to this noob.

Thanks,
Matthew John