You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/03/07 20:51:46 UTC
[jira] Commented: (HADOOP-2853) Add Writable for very large lists of key / value pairs

    [ https://issues.apache.org/jira/browse/HADOOP-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576349#action_12576349 ] 

Chris Douglas commented on HADOOP-2853:
---------------------------------------

Hi Andrzej-

I'm not sure I understand your use case. In particular, why would a long list of values need to be processed as a single value? If the deserialization itself were a problem- i.e. each record was too large to fit into memory- then a paging strategy might make sense, but if you're processing a _list_ of records then this approach makes what was formerly granular enough for the framework to manage into an opaque blob. Wouldn't you be better off tagging your keys to effect the groupings you're managing explicitly? Can you describe the problem this patch resolves in more detail?

> Add Writable for very large lists of key / value pairs
> ------------------------------------------------------
>
>                 Key: HADOOP-2853
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2853
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.17.0
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.17.0
>
>         Attachments: sequenceWritable-v1.patch, sequenceWritable-v2.patch, sequenceWritable-v3.patch, sequenceWritable-v4.patch, sequenceWritable-v5.patch
>
>
> Some map-reduce jobs need to aggregate and process very long lists as a single value. This usually happens when keys from a large domain are mapped into a small domain, and their associated values cannot be aggregated into few values but need to be preserved as members of a large list. Currently this can be implemented as a MapWritable or ArrayWritable - however, Hadoop needs to deserialize the current key and value completely into memory, which for extremely large values causes frequent OOM exceptions. This also works only with lists of relatively small size (e.g. 1000 records).
> This patch is an implementation of a Writable that can handle arbitrarily long lists. Initially it keeps an internal buffer (which can be (de)-serialized in the ordinary way), and if the list size exceeds certain threshold it is spilled to an external SequenceFile (hence the name) on a configured FileSystem. The content of this Writable can be iterated, and the data is pulled either from the internal buffer or from the external file in a transparent way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.