You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/08/09 19:01:21 UTC

Not able to understand writing custom writable

Hi,
   I am trying to understand, how to write my own writable.
So basically trying to understand how to process records spanning multiple
lines.

Can some one break down to me, that what are the things needed to be
considered in each method??

I am trying to understand this example:
https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java

Can someone explain to me in simple language what is each code block
suppose to do.

My apologies for asking such a "vaguely" posed question?
Thanks

Re: Not able to understand writing custom writable

Posted by Shahab Yunus <sh...@gmail.com>.
The overarching responsibility of a record reader is to return one record
which in case of conventional, traditionally means one line. But as we see
in this case, it cannot be always true. An xml file can have physically
multiple lines but functionally they map to one record or one line. For
this, we have to write our own record reader to perform the mapping or
conversion from multiple physical lines/records to a single functional
record/line.

In the constructor a split is being passed which represents a chunk of
contiguous records from the source xml file. For processing this split
which can contain many physical and functional records, bootstrapping is
being done, one-time variables are being initialized and total length of
the data is also been calculated. Also, a reader is being opened and when I
say reader it is for reading the normal XML file with the help of Java I/O
classes (or Hadoop's abstraction over them.) Also noticeable is the fact
that the whole file is being opened to read but then for this particular
split, the reading will start from the split's start point as evident by
the 'seek' method call. Each split gets the same file and opens it for
reading but actually starts reading only from the point where its split is
suppose to being, hence the call to 'seek' as mentioned.

In the 'next' method which is overriden and would be called by the
framework to read atomically, the next functional and physical record, it
is regular Java I/O and XML tag logic. Nothing specific to M/R. What you
are trying to do is to read everything between an XML tag (start and end)
specified by you and set in the constructor. Everything in between this
start and end tag (which can be more nested tags or just text) would be
considered one record. You are then simply using Java I/O classes and then
 String/Byte manipulation and comparison parsing and constructing your
record. When the 'read' stream is exhausted (fsin.getPos() < end) it means
that you have processed all the data for this split and we are done.

Given an xml file like this:
<tag1>     *//split 1*
<tag2>a</tag2>
<tag3>b</tag3>
</tag1*>  //**byte position of '>' is the key of 1st record*
<tag1>     *//split 2*
<tag2>c</tag2>
<tag3>d</tag3>
</tag1*> //**byte position of '>' is the key of 2nd record*
<tag1> *//split 3*
<tag2>e</tag2>
<tag3>f</tag3>
</tag1*>* *//**byte position of '>' is the key of 3rd record*

Let us say you have 3 splits. For the first split your start tag would be
'tag1' and end tag '/tag1' (the one which constitutes 'a' and 'b' sub-tags)
and after processing this split you will have 1 record. The first call to
the match (readUntilMatch) method with 'withinBlock=false' will just seek
to the end of the start tag (tag1) and will not buffer anything. The next
call to the same match method with the second parameter set to true now
will begin reading where the stat tag ended and continue reading till the
end tag (</tag1>) or end of split (or file if it was the last split) and
this time will save or buffer the data encountered which is exactly we want
i.e. data between the start and end tag which will form our value, our one
functional record. There is some logic for handling corner cases and sanity
checks in there as well.

You will also notice that createKey and createValue methods are overridden
as well which will be first called by the framework and then the call to
the 'next' method would be made. Notice that we are passing 'key' and
'value' to the 'next method and these objects are being then actually set
with the functional record that you compute after parsing the multi-line
XML chunk.

Something like:
.....
LongWritable key = createKey();
Text createValue() = createValue();
if(next(key, value)) {
//continue reading more (functional) records
}
.....
Also note that the key of the constructed and returned functional record is
the physical byte# or position streamed from the file. Which means that any
2 consecutive functional records will not have key differing by 1. It will
be most probably (I say 'probably' as I might be off about the edge/-1
cases here) the number of bytes read between the start and end tag. Also,
the physical line# of the end tag is being used and not of the start tag's.

The getProgress method, also called by the framework just gives you a
estimate using simple math about how much of the split has been processed.

Regards,
Shahab


On Fri, Aug 9, 2013 at 1:01 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am trying to understand, how to write my own writable.
> So basically trying to understand how to process records spanning multiple
> lines.
>
> Can some one break down to me, that what are the things needed to be
> considered in each method??
>
> I am trying to understand this example:
>
> https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
>
> Can someone explain to me in simple language what is each code block
> suppose to do.
>
> My apologies for asking such a "vaguely" posed question?
> Thanks
>

Re: Not able to understand writing custom writable

Posted by Shahab Yunus <sh...@gmail.com>.
The overarching responsibility of a record reader is to return one record
which in case of conventional, traditionally means one line. But as we see
in this case, it cannot be always true. An xml file can have physically
multiple lines but functionally they map to one record or one line. For
this, we have to write our own record reader to perform the mapping or
conversion from multiple physical lines/records to a single functional
record/line.

In the constructor a split is being passed which represents a chunk of
contiguous records from the source xml file. For processing this split
which can contain many physical and functional records, bootstrapping is
being done, one-time variables are being initialized and total length of
the data is also been calculated. Also, a reader is being opened and when I
say reader it is for reading the normal XML file with the help of Java I/O
classes (or Hadoop's abstraction over them.) Also noticeable is the fact
that the whole file is being opened to read but then for this particular
split, the reading will start from the split's start point as evident by
the 'seek' method call. Each split gets the same file and opens it for
reading but actually starts reading only from the point where its split is
suppose to being, hence the call to 'seek' as mentioned.

In the 'next' method which is overriden and would be called by the
framework to read atomically, the next functional and physical record, it
is regular Java I/O and XML tag logic. Nothing specific to M/R. What you
are trying to do is to read everything between an XML tag (start and end)
specified by you and set in the constructor. Everything in between this
start and end tag (which can be more nested tags or just text) would be
considered one record. You are then simply using Java I/O classes and then
 String/Byte manipulation and comparison parsing and constructing your
record. When the 'read' stream is exhausted (fsin.getPos() < end) it means
that you have processed all the data for this split and we are done.

Given an xml file like this:
<tag1>     *//split 1*
<tag2>a</tag2>
<tag3>b</tag3>
</tag1*>  //**byte position of '>' is the key of 1st record*
<tag1>     *//split 2*
<tag2>c</tag2>
<tag3>d</tag3>
</tag1*> //**byte position of '>' is the key of 2nd record*
<tag1> *//split 3*
<tag2>e</tag2>
<tag3>f</tag3>
</tag1*>* *//**byte position of '>' is the key of 3rd record*

Let us say you have 3 splits. For the first split your start tag would be
'tag1' and end tag '/tag1' (the one which constitutes 'a' and 'b' sub-tags)
and after processing this split you will have 1 record. The first call to
the match (readUntilMatch) method with 'withinBlock=false' will just seek
to the end of the start tag (tag1) and will not buffer anything. The next
call to the same match method with the second parameter set to true now
will begin reading where the stat tag ended and continue reading till the
end tag (</tag1>) or end of split (or file if it was the last split) and
this time will save or buffer the data encountered which is exactly we want
i.e. data between the start and end tag which will form our value, our one
functional record. There is some logic for handling corner cases and sanity
checks in there as well.

You will also notice that createKey and createValue methods are overridden
as well which will be first called by the framework and then the call to
the 'next' method would be made. Notice that we are passing 'key' and
'value' to the 'next method and these objects are being then actually set
with the functional record that you compute after parsing the multi-line
XML chunk.

Something like:
.....
LongWritable key = createKey();
Text createValue() = createValue();
if(next(key, value)) {
//continue reading more (functional) records
}
.....
Also note that the key of the constructed and returned functional record is
the physical byte# or position streamed from the file. Which means that any
2 consecutive functional records will not have key differing by 1. It will
be most probably (I say 'probably' as I might be off about the edge/-1
cases here) the number of bytes read between the start and end tag. Also,
the physical line# of the end tag is being used and not of the start tag's.

The getProgress method, also called by the framework just gives you a
estimate using simple math about how much of the split has been processed.

Regards,
Shahab


On Fri, Aug 9, 2013 at 1:01 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am trying to understand, how to write my own writable.
> So basically trying to understand how to process records spanning multiple
> lines.
>
> Can some one break down to me, that what are the things needed to be
> considered in each method??
>
> I am trying to understand this example:
>
> https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
>
> Can someone explain to me in simple language what is each code block
> suppose to do.
>
> My apologies for asking such a "vaguely" posed question?
> Thanks
>

Re: Not able to understand writing custom writable

Posted by Shahab Yunus <sh...@gmail.com>.
The overarching responsibility of a record reader is to return one record
which in case of conventional, traditionally means one line. But as we see
in this case, it cannot be always true. An xml file can have physically
multiple lines but functionally they map to one record or one line. For
this, we have to write our own record reader to perform the mapping or
conversion from multiple physical lines/records to a single functional
record/line.

In the constructor a split is being passed which represents a chunk of
contiguous records from the source xml file. For processing this split
which can contain many physical and functional records, bootstrapping is
being done, one-time variables are being initialized and total length of
the data is also been calculated. Also, a reader is being opened and when I
say reader it is for reading the normal XML file with the help of Java I/O
classes (or Hadoop's abstraction over them.) Also noticeable is the fact
that the whole file is being opened to read but then for this particular
split, the reading will start from the split's start point as evident by
the 'seek' method call. Each split gets the same file and opens it for
reading but actually starts reading only from the point where its split is
suppose to being, hence the call to 'seek' as mentioned.

In the 'next' method which is overriden and would be called by the
framework to read atomically, the next functional and physical record, it
is regular Java I/O and XML tag logic. Nothing specific to M/R. What you
are trying to do is to read everything between an XML tag (start and end)
specified by you and set in the constructor. Everything in between this
start and end tag (which can be more nested tags or just text) would be
considered one record. You are then simply using Java I/O classes and then
 String/Byte manipulation and comparison parsing and constructing your
record. When the 'read' stream is exhausted (fsin.getPos() < end) it means
that you have processed all the data for this split and we are done.

Given an xml file like this:
<tag1>     *//split 1*
<tag2>a</tag2>
<tag3>b</tag3>
</tag1*>  //**byte position of '>' is the key of 1st record*
<tag1>     *//split 2*
<tag2>c</tag2>
<tag3>d</tag3>
</tag1*> //**byte position of '>' is the key of 2nd record*
<tag1> *//split 3*
<tag2>e</tag2>
<tag3>f</tag3>
</tag1*>* *//**byte position of '>' is the key of 3rd record*

Let us say you have 3 splits. For the first split your start tag would be
'tag1' and end tag '/tag1' (the one which constitutes 'a' and 'b' sub-tags)
and after processing this split you will have 1 record. The first call to
the match (readUntilMatch) method with 'withinBlock=false' will just seek
to the end of the start tag (tag1) and will not buffer anything. The next
call to the same match method with the second parameter set to true now
will begin reading where the stat tag ended and continue reading till the
end tag (</tag1>) or end of split (or file if it was the last split) and
this time will save or buffer the data encountered which is exactly we want
i.e. data between the start and end tag which will form our value, our one
functional record. There is some logic for handling corner cases and sanity
checks in there as well.

You will also notice that createKey and createValue methods are overridden
as well which will be first called by the framework and then the call to
the 'next' method would be made. Notice that we are passing 'key' and
'value' to the 'next method and these objects are being then actually set
with the functional record that you compute after parsing the multi-line
XML chunk.

Something like:
.....
LongWritable key = createKey();
Text createValue() = createValue();
if(next(key, value)) {
//continue reading more (functional) records
}
.....
Also note that the key of the constructed and returned functional record is
the physical byte# or position streamed from the file. Which means that any
2 consecutive functional records will not have key differing by 1. It will
be most probably (I say 'probably' as I might be off about the edge/-1
cases here) the number of bytes read between the start and end tag. Also,
the physical line# of the end tag is being used and not of the start tag's.

The getProgress method, also called by the framework just gives you a
estimate using simple math about how much of the split has been processed.

Regards,
Shahab


On Fri, Aug 9, 2013 at 1:01 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am trying to understand, how to write my own writable.
> So basically trying to understand how to process records spanning multiple
> lines.
>
> Can some one break down to me, that what are the things needed to be
> considered in each method??
>
> I am trying to understand this example:
>
> https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
>
> Can someone explain to me in simple language what is each code block
> suppose to do.
>
> My apologies for asking such a "vaguely" posed question?
> Thanks
>

Re: Not able to understand writing custom writable

Posted by Shahab Yunus <sh...@gmail.com>.
The overarching responsibility of a record reader is to return one record
which in case of conventional, traditionally means one line. But as we see
in this case, it cannot be always true. An xml file can have physically
multiple lines but functionally they map to one record or one line. For
this, we have to write our own record reader to perform the mapping or
conversion from multiple physical lines/records to a single functional
record/line.

In the constructor a split is being passed which represents a chunk of
contiguous records from the source xml file. For processing this split
which can contain many physical and functional records, bootstrapping is
being done, one-time variables are being initialized and total length of
the data is also been calculated. Also, a reader is being opened and when I
say reader it is for reading the normal XML file with the help of Java I/O
classes (or Hadoop's abstraction over them.) Also noticeable is the fact
that the whole file is being opened to read but then for this particular
split, the reading will start from the split's start point as evident by
the 'seek' method call. Each split gets the same file and opens it for
reading but actually starts reading only from the point where its split is
suppose to being, hence the call to 'seek' as mentioned.

In the 'next' method which is overriden and would be called by the
framework to read atomically, the next functional and physical record, it
is regular Java I/O and XML tag logic. Nothing specific to M/R. What you
are trying to do is to read everything between an XML tag (start and end)
specified by you and set in the constructor. Everything in between this
start and end tag (which can be more nested tags or just text) would be
considered one record. You are then simply using Java I/O classes and then
 String/Byte manipulation and comparison parsing and constructing your
record. When the 'read' stream is exhausted (fsin.getPos() < end) it means
that you have processed all the data for this split and we are done.

Given an xml file like this:
<tag1>     *//split 1*
<tag2>a</tag2>
<tag3>b</tag3>
</tag1*>  //**byte position of '>' is the key of 1st record*
<tag1>     *//split 2*
<tag2>c</tag2>
<tag3>d</tag3>
</tag1*> //**byte position of '>' is the key of 2nd record*
<tag1> *//split 3*
<tag2>e</tag2>
<tag3>f</tag3>
</tag1*>* *//**byte position of '>' is the key of 3rd record*

Let us say you have 3 splits. For the first split your start tag would be
'tag1' and end tag '/tag1' (the one which constitutes 'a' and 'b' sub-tags)
and after processing this split you will have 1 record. The first call to
the match (readUntilMatch) method with 'withinBlock=false' will just seek
to the end of the start tag (tag1) and will not buffer anything. The next
call to the same match method with the second parameter set to true now
will begin reading where the stat tag ended and continue reading till the
end tag (</tag1>) or end of split (or file if it was the last split) and
this time will save or buffer the data encountered which is exactly we want
i.e. data between the start and end tag which will form our value, our one
functional record. There is some logic for handling corner cases and sanity
checks in there as well.

You will also notice that createKey and createValue methods are overridden
as well which will be first called by the framework and then the call to
the 'next' method would be made. Notice that we are passing 'key' and
'value' to the 'next method and these objects are being then actually set
with the functional record that you compute after parsing the multi-line
XML chunk.

Something like:
.....
LongWritable key = createKey();
Text createValue() = createValue();
if(next(key, value)) {
//continue reading more (functional) records
}
.....
Also note that the key of the constructed and returned functional record is
the physical byte# or position streamed from the file. Which means that any
2 consecutive functional records will not have key differing by 1. It will
be most probably (I say 'probably' as I might be off about the edge/-1
cases here) the number of bytes read between the start and end tag. Also,
the physical line# of the end tag is being used and not of the start tag's.

The getProgress method, also called by the framework just gives you a
estimate using simple math about how much of the split has been processed.

Regards,
Shahab


On Fri, Aug 9, 2013 at 1:01 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am trying to understand, how to write my own writable.
> So basically trying to understand how to process records spanning multiple
> lines.
>
> Can some one break down to me, that what are the things needed to be
> considered in each method??
>
> I am trying to understand this example:
>
> https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
>
> Can someone explain to me in simple language what is each code block
> suppose to do.
>
> My apologies for asking such a "vaguely" posed question?
> Thanks
>