You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/08/16 12:12:48 UTC

[Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/SequenceFile

The comment on the change is:
First Cut

New page:
== Overview ==

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
It is also worth noting the the ''output'' of the Map is always a SequenceFile.

The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.

There are 3 different !SequenceFile formats:
 1. Uncompressed key/value records.
 2. Record compressed key/value records - only 'values' are compressed here.
 3. Block compressed key/value records - both keys are values are collected in 'blocks' separately and compressed.

The recommended way is to use the SequenceFile.createWriter methods to construct the 'preferred' writer implementation.

The [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html SequenceFile.Reader] acts as a bridge and can read any of the above SequenceFile formats.

== SequenceFile Formats ==

Essentially there are 3 different file formats for !SequenceFiles depending on whether ''compression'' and ''block compression'' are active.


However any of the above formats share a common ''header'' (which is used by the !SequenceFile.Reader to return the appropriate key/value pairs). The next section summarises the header:
[[Anchor(SeqFileHeader)]]===== SequenceFile Common Header =====
 * version - A byte array: SEQ<version no.>
 * keyClassName - String
 * valueClassName - String
 * compression - A boolean which specifies if ''compression'' is turned on for keys/values in this file.
 * blockCompression -  A boolean which specifies if ''block compression'' is turned on for keys/values in this file.
 * sync - A sync marker to denote end of the header.


The formats for Uncompressed/!RecordCompressed Writers are very similar:
===== Uncompressed/RecordCompressed Writer Format =====
 * [#SeqFileHeader Header]
 * Record
   * Key
   * (Compressed?) Value
 * A sync-marker every 100bytes or so to help in seeking to a random point in the file and then seeking to next ''record''.
<br>

The format for the !BlockCompressedWriter is as follows:
===== BlockCompressed Writer Format =====
 * [#SeqFileHeader Header]
 * Record ''Block''
   * !CompressedKeyLengthsBlockSize
   * !CompressedKeyLengthsBlock
   * !CompressedKeysBlockSize
   * !CompressedKeysBlock
   * !CompressedValueLengthsBlockSize
   * !CompressedValueLengthsBlock
   * !CompressedValuesBlockSize
   * !CompressedValuesBlock
   * A sync-marker to help in seeking to a random point in the file and then seeking to next ''record block''.