You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by d_k <ma...@gmail.com> on 2014/01/20 16:02:02 UTC

What is the correct way to serialize a MapWritable to WebPage's metadata?

I'm working on porting NUTCH-1622 to Nutch 2 and the path I took was to add
a MapWritable field to the Outlink class to hold the metadata.

In order to store the metadata in the WebPage so it will be passed along
the mappers and reducers I used the metadata field of the WebPage class.

Because the putToMetadata method of the WebPage accepts a ByteBuffer, in
order to convert the MapWritable to a ByteBuffer i'm using something along
the lines of:

ByteArrayOutputStream outStream = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(outStream);

MapWritable outlinkMap = new MapWritable();

// ... fill outlinkMap ...

    try {
        outlinkMap.write(dataOut);
        dataOut.close();
    }
    catch (IOException e) {
            LOG.warn("...");
    }

ByteBuffer byteBuffer = ByteBuffer.wrap(outStream.toByteArray());
page.putToMetadata(new Utf8("outlinks-metadata"), byteBuffer);

And I would be happy to get some input on:
1) Is it the correct way to convert the MapWritable to a ByteBuffer to be
stored in the WebPage's metadata?
2) Should the metadata be stored in the metadata field as a ByteBuffer or
is there a better way to pass along the metadata?
3) Did I waste my time working with MapWritable and could of used any java
collection as long as the target JVM could of deserialized it considering
that all that is passed is an array of bytes and Outlink is never passed as
it is. Outlinks are passed as a map between url and anchor (utf8, utf8).

... my next change was to make the Utf8 allocation static... :-P