You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kylie McCormick <ky...@gmail.com> on 2008/07/15 00:39:42 UTC

Writable readFields and write functions

Hi There!
I'm currently working on code for my own Writable object (called
ServiceWritable) and I've been working off LongWritable for this one. I was
wondering, however, about the following two functions:

public void readFields(java.io.DataInput in)
and
public void write(java.io.DataOutput out)

I have my own RecordReader object to read in the complex type Service, and I
also have my own Writer object to write my complex type ResultSet for
output. In LongWritable, the code is very simple:

value = in.readLong()
and
out.writeLong(value);

Since I am dealing with more complex objects, the ObjectWritable won't help
me. I'm a little confused with the interaction here between my RecordReader,
and Writer objects--because there does not seem to be any directly. Can
someone help me out here?

Thanks,
Kylie

Re: Writable readFields and write functions

Posted by Chris Douglas <ch...@yahoo-inc.com>.
> -- Presently, my RecordReader converts XML strings from a file to  
> MyWritable
> object
> -- When readFields is called, RecordReader should provide the next
> MyWritable object, if there is one
> -- When write is called, MyWriter should write the objects out

Not quite. Your RecordReader may produce MyWritable records, but  
readFields may not be involved. For your MyWritable records to get to  
your reduce, they should implement the Writable interface so the  
framework may regard them as streams of bytes. Your OutputFormat-  
which may use your MyWriter- may take the MyWritable objects you emit  
from your reduce and make them conform to whatever format your spec  
requires.

* Your InputFormat takes XML and provides MyWritable objects to your  
mapper
* The framework calls MyWritable::write(byte_stream) and  
MyWritable::readFields(byte_stream) to push records you emit from your  
mapper across the network, between abstractions, etc.
* Your OuputFormat takes MyWritable objects you emit from your reducer  
and stores them according to the format you specify

With many exceptions, most RecordReaders calling readFields are  
reading from structured, generic formats (like SequenceFile). -C

> The RecordReader is record-oriented, but both the readFields and write
> functions are byte-oriented... in order for Hadoop to be happy, I  
> need to
> coordinate my record-oriented to byte-oriented.
>
> Is this correct? I just want to make sure before I tinker more with  
> the
> code, to have the design properly down.
>
> Thanks!
> Kylie
>
>
> On Mon, Jul 14, 2008 at 3:43 PM, Chris Douglas <ch...@yahoo-inc.com>
> wrote:
>
>> It's easiest to consider write as a function that converts your  
>> record to
>> bytes and readFields as a function restoring your record from  
>> bytes. So it
>> should be the case that:
>>
>> MyWritable i = new MyWritable();
>> i.initWithData(some_data);
>> i.write(byte_stream);
>> ...
>> MyWritable j = new MyWritable();
>> j.initWithData(some_other_data); // (1)
>> j.readFields(byte_stream);
>> assert i.equals(j);
>>
>> Note that the assert should be true whether or not (1) is present,  
>> i.e. a
>> call to readFields should be deterministic and without hysteresis  
>> (it should
>> make no difference whether the Writable is newly created or if it  
>> formally
>> held some other state). readFields must also consume the entire  
>> record, so
>> for example, if write outputs three integers, readFields must  
>> consume three
>> integers. Variable-sized Writables are common, but any optional/ 
>> variably
>> sized fields must be encoded to satisfy the preceding.
>>
>> So if your MyBigWritable record held two ints (integerA, integerB)  
>> and a
>> MyWritable (my_writable), its write method might look like:
>>
>> out.writeInt(integerA);
>> out.writeInt(integerB);
>> my_writable.write(out);
>>
>> and readFields would restore:
>>
>> integerA = in.readInt(in);
>> integerB = in.readInt(in);
>> my_writable.readFields(in);
>>
>> There are many examples in the source of simple, compound, and
>> variably-sized Writables.
>>
>> Your RecordReader is responsible for providing a key and value to  
>> your map.
>> Most generic formats rely on Writables or another mode of  
>> serialization to
>> write and restore objects to/from structured byte sequences, but less
>> generic InputFormats will create Writables from byte streams.
>> TextInputFormat, for example, will create Text objects from CR- 
>> delimited
>> files, though Text objects are not, themselves, encoded in the  
>> file. In
>> constrast, a SequenceFile storing the same data will encode the  
>> Text object
>> (using its write method) and will restore that object as encoded.
>>
>> The critical difference is that the framework needs to convert your  
>> record
>> to a byte stream at various points- hence the Writable interface-  
>> while you
>> may be more particular about the format from which you consume and  
>> the
>> format to which you need your output to conform. Note that you can  
>> elect to
>> use a different serialization framework if you prefer.
>>
>> If your data structure will be used as a key (implementing
>> WritableComparable), it's strongly recommended that you implement a
>> RawComparator, which can compare the serialized bytes directly  
>> without
>> deserializing both arguments. -C
>>
>>
>> On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:
>>
>> Hi There!
>>> I'm currently working on code for my own Writable object (called
>>> ServiceWritable) and I've been working off LongWritable for this  
>>> one. I
>>> was
>>> wondering, however, about the following two functions:
>>>
>>> public void readFields(java.io.DataInput in)
>>> and
>>> public void write(java.io.DataOutput out)
>>>
>>> I have my own RecordReader object to read in the complex type  
>>> Service, and
>>> I
>>> also have my own Writer object to write my complex type ResultSet  
>>> for
>>> output. In LongWritable, the code is very simple:
>>>
>>> value = in.readLong()
>>> and
>>> out.writeLong(value);
>>>
>>> Since I am dealing with more complex objects, the ObjectWritable  
>>> won't
>>> help
>>> me. I'm a little confused with the interaction here between my
>>> RecordReader,
>>> and Writer objects--because there does not seem to be any  
>>> directly. Can
>>> someone help me out here?
>>>
>>> Thanks,
>>> Kylie
>>>
>>
>>
>
>
> -- 
> The Circle of the Dragon -- unlock the mystery that is the dragon.
> http://www.blackdrago.com/index.html
>
> "Light, seeking light, doth the light of light beguile!"
> -- William Shakespeare's Love's Labor's Lost


Re: Writable readFields and write functions

Posted by Kylie McCormick <ky...@gmail.com>.
Hello Chris:
Thanks for the prompt reply!

So, to conclude from your note:
-- Presently, my RecordReader converts XML strings from a file to MyWritable
object
-- When readFields is called, RecordReader should provide the next
MyWritable object, if there is one
-- When write is called, MyWriter should write the objects out

The RecordReader is record-oriented, but both the readFields and write
functions are byte-oriented... in order for Hadoop to be happy, I need to
coordinate my record-oriented to byte-oriented.

Is this correct? I just want to make sure before I tinker more with the
code, to have the design properly down.

Thanks!
Kylie


On Mon, Jul 14, 2008 at 3:43 PM, Chris Douglas <ch...@yahoo-inc.com>
wrote:

> It's easiest to consider write as a function that converts your record to
> bytes and readFields as a function restoring your record from bytes. So it
> should be the case that:
>
> MyWritable i = new MyWritable();
> i.initWithData(some_data);
> i.write(byte_stream);
> ...
> MyWritable j = new MyWritable();
> j.initWithData(some_other_data); // (1)
> j.readFields(byte_stream);
> assert i.equals(j);
>
> Note that the assert should be true whether or not (1) is present, i.e. a
> call to readFields should be deterministic and without hysteresis (it should
> make no difference whether the Writable is newly created or if it formally
> held some other state). readFields must also consume the entire record, so
> for example, if write outputs three integers, readFields must consume three
> integers. Variable-sized Writables are common, but any optional/variably
> sized fields must be encoded to satisfy the preceding.
>
> So if your MyBigWritable record held two ints (integerA, integerB) and a
> MyWritable (my_writable), its write method might look like:
>
> out.writeInt(integerA);
> out.writeInt(integerB);
> my_writable.write(out);
>
> and readFields would restore:
>
> integerA = in.readInt(in);
> integerB = in.readInt(in);
> my_writable.readFields(in);
>
> There are many examples in the source of simple, compound, and
> variably-sized Writables.
>
> Your RecordReader is responsible for providing a key and value to your map.
> Most generic formats rely on Writables or another mode of serialization to
> write and restore objects to/from structured byte sequences, but less
> generic InputFormats will create Writables from byte streams.
> TextInputFormat, for example, will create Text objects from CR-delimited
> files, though Text objects are not, themselves, encoded in the file. In
> constrast, a SequenceFile storing the same data will encode the Text object
> (using its write method) and will restore that object as encoded.
>
> The critical difference is that the framework needs to convert your record
> to a byte stream at various points- hence the Writable interface- while you
> may be more particular about the format from which you consume and the
> format to which you need your output to conform. Note that you can elect to
> use a different serialization framework if you prefer.
>
> If your data structure will be used as a key (implementing
> WritableComparable), it's strongly recommended that you implement a
> RawComparator, which can compare the serialized bytes directly without
> deserializing both arguments. -C
>
>
> On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:
>
>  Hi There!
>> I'm currently working on code for my own Writable object (called
>> ServiceWritable) and I've been working off LongWritable for this one. I
>> was
>> wondering, however, about the following two functions:
>>
>> public void readFields(java.io.DataInput in)
>> and
>> public void write(java.io.DataOutput out)
>>
>> I have my own RecordReader object to read in the complex type Service, and
>> I
>> also have my own Writer object to write my complex type ResultSet for
>> output. In LongWritable, the code is very simple:
>>
>> value = in.readLong()
>> and
>> out.writeLong(value);
>>
>> Since I am dealing with more complex objects, the ObjectWritable won't
>> help
>> me. I'm a little confused with the interaction here between my
>> RecordReader,
>> and Writer objects--because there does not seem to be any directly. Can
>> someone help me out here?
>>
>> Thanks,
>> Kylie
>>
>
>


-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost

Re: Writable readFields and write functions

Posted by Chris Douglas <ch...@yahoo-inc.com>.
It's easiest to consider write as a function that converts your record  
to bytes and readFields as a function restoring your record from  
bytes. So it should be the case that:

MyWritable i = new MyWritable();
i.initWithData(some_data);
i.write(byte_stream);
...
MyWritable j = new MyWritable();
j.initWithData(some_other_data); // (1)
j.readFields(byte_stream);
assert i.equals(j);

Note that the assert should be true whether or not (1) is present,  
i.e. a call to readFields should be deterministic and without  
hysteresis (it should make no difference whether the Writable is newly  
created or if it formally held some other state). readFields must also  
consume the entire record, so for example, if write outputs three  
integers, readFields must consume three integers. Variable-sized  
Writables are common, but any optional/variably sized fields must be  
encoded to satisfy the preceding.

So if your MyBigWritable record held two ints (integerA, integerB) and  
a MyWritable (my_writable), its write method might look like:

out.writeInt(integerA);
out.writeInt(integerB);
my_writable.write(out);

and readFields would restore:

integerA = in.readInt(in);
integerB = in.readInt(in);
my_writable.readFields(in);

There are many examples in the source of simple, compound, and  
variably-sized Writables.

Your RecordReader is responsible for providing a key and value to your  
map. Most generic formats rely on Writables or another mode of  
serialization to write and restore objects to/from structured byte  
sequences, but less generic InputFormats will create Writables from  
byte streams. TextInputFormat, for example, will create Text objects  
from CR-delimited files, though Text objects are not, themselves,  
encoded in the file. In constrast, a SequenceFile storing the same  
data will encode the Text object (using its write method) and will  
restore that object as encoded.

The critical difference is that the framework needs to convert your  
record to a byte stream at various points- hence the Writable  
interface- while you may be more particular about the format from  
which you consume and the format to which you need your output to  
conform. Note that you can elect to use a different serialization  
framework if you prefer.

If your data structure will be used as a key (implementing  
WritableComparable), it's strongly recommended that you implement a  
RawComparator, which can compare the serialized bytes directly without  
deserializing both arguments. -C

On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:

> Hi There!
> I'm currently working on code for my own Writable object (called
> ServiceWritable) and I've been working off LongWritable for this  
> one. I was
> wondering, however, about the following two functions:
>
> public void readFields(java.io.DataInput in)
> and
> public void write(java.io.DataOutput out)
>
> I have my own RecordReader object to read in the complex type  
> Service, and I
> also have my own Writer object to write my complex type ResultSet for
> output. In LongWritable, the code is very simple:
>
> value = in.readLong()
> and
> out.writeLong(value);
>
> Since I am dealing with more complex objects, the ObjectWritable  
> won't help
> me. I'm a little confused with the interaction here between my  
> RecordReader,
> and Writer objects--because there does not seem to be any directly.  
> Can
> someone help me out here?
>
> Thanks,
> Kylie