You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2008/04/25 23:41:55 UTC

[jira] Commented: (HADOOP-3315) New binary file format

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592515#action_12592515 ] 

Doug Cutting commented on HADOOP-3315:
--------------------------------------

[Meta comment: I wish folks would just describe problems in an issue's description, and leave solutions to the comments.  Descriptions are appended to every email message.  Also, solutions change as a result of discussion, while the problem should not.]

Is this a format just for compressed sequence files, or for all sequence files?

Is this intended as a replacement for MapFile too?

I think some kind of a magic number header at the start files is good to have.  That would also permit back-compatibility with SequenceFile in this case.

In the index, what is "first record idx" -- is that the key or the ordinal position of the first entry?




> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. I would propose that we move to:
> {code}
> block 1
> block 2
> ...
> index
> tail
> {code}
> where block is compressed, and contain:
> {code}
> key/value1: key len (vint), key, value len (vint), value
> key/value 2
> ...
> {code}
> The index would be compressed and contain:
> {code}
> block 1: offset, first record idx
> block 2: offset, first record idx
> block 3: offset, first record idx:
> ...
> {code}
> and the tail would look like:
> {code}
> key class name
> value class name
> index kind (none, keys, keys+bloom filter)
> format version
> offset of tail
> offset of index
> {code}
> Then extensions of this format would put more indexes between the last block and the start of the index. So for example, the first key of each block:
> {code}
> first key of block 1: key len (vint), key
> first key of block 2
> ...
> offset of start key index
> {code}
> Another reasonable extension of the key index would be a bloom filter of the keys:
> {code}
> bloom filter serialization
> offset of bloom filter index start
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.