You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2008/04/26 18:06:55 UTC

[jira] Updated: (HADOOP-3315) New binary file format

     [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-3315:
----------------------------------

    Description: SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs   (was: SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. I would propose that we move to:

{code}
block 1
block 2
...
index
tail
{code}

where block is compressed, and contain:

{code}
key/value1: key len (vint), key, value len (vint), value
key/value 2
...
{code}

The index would be compressed and contain:

{code}
block 1: offset, first record idx
block 2: offset, first record idx
block 3: offset, first record idx:
...
{code}

and the tail would look like:

{code}
key class name
value class name
index kind (none, keys, keys+bloom filter)
format version
offset of tail
offset of index
{code}

Then extensions of this format would put more indexes between the last block and the start of the index. So for example, the first key of each block:

{code}
first key of block 1: key len (vint), key
first key of block 2
...
offset of start key index
{code}

Another reasonable extension of the key index would be a bloom filter of the keys:

{code}
bloom filter serialization
offset of bloom filter index start
{code}

)

 I would propose that we move to:

{code}
block 1
block 2
...
index
tail
{code}

where block is compressed, and contain:

{code}
key/value1: key len (vint), key, value len (vint), value
key/value 2
...
{code}

The index would be compressed and contain:

{code}
block 1: offset, first record idx
block 2: offset, first record idx
block 3: offset, first record idx:
...
{code}

and the tail would look like:

{code}
key class name
value class name
index kind (none, keys, keys+bloom filter)
format version
offset of tail
offset of index
{code}

Then extensions of this format would put more indexes between the last block and the start of the index. So for example, the first key of each block:

{code}
first key of block 1: key len (vint), key
first key of block 2
...
offset of start key index
{code}

Another reasonable extension of the key index would be a bloom filter of the keys:

{code}
bloom filter serialization
offset of bloom filter index start
{code}



> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.