You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/08/16 12:12:48 UTC
[Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/SequenceFile
The comment on the change is:
First Cut
New page:
== Overview ==
SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
It is also worth noting the the ''output'' of the Map is always a SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.
There are 3 different !SequenceFile formats:
1. Uncompressed key/value records.
2. Record compressed key/value records - only 'values' are compressed here.
3. Block compressed key/value records - both keys are values are collected in 'blocks' separately and compressed.
The recommended way is to use the SequenceFile.createWriter methods to construct the 'preferred' writer implementation.
The [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html SequenceFile.Reader] acts as a bridge and can read any of the above SequenceFile formats.
== SequenceFile Formats ==
Essentially there are 3 different file formats for !SequenceFiles depending on whether ''compression'' and ''block compression'' are active.
However any of the above formats share a common ''header'' (which is used by the !SequenceFile.Reader to return the appropriate key/value pairs). The next section summarises the header:
[[Anchor(SeqFileHeader)]]===== SequenceFile Common Header =====
* version - A byte array: SEQ<version no.>
* keyClassName - String
* valueClassName - String
* compression - A boolean which specifies if ''compression'' is turned on for keys/values in this file.
* blockCompression - A boolean which specifies if ''block compression'' is turned on for keys/values in this file.
* sync - A sync marker to denote end of the header.
The formats for Uncompressed/!RecordCompressed Writers are very similar:
===== Uncompressed/RecordCompressed Writer Format =====
* [#SeqFileHeader Header]
* Record
* Key
* (Compressed?) Value
* A sync-marker every 100bytes or so to help in seeking to a random point in the file and then seeking to next ''record''.
<br>
The format for the !BlockCompressedWriter is as follows:
===== BlockCompressed Writer Format =====
* [#SeqFileHeader Header]
* Record ''Block''
* !CompressedKeyLengthsBlockSize
* !CompressedKeyLengthsBlock
* !CompressedKeysBlockSize
* !CompressedKeysBlock
* !CompressedValueLengthsBlockSize
* !CompressedValueLengthsBlock
* !CompressedValuesBlockSize
* !CompressedValuesBlock
* A sync-marker to help in seeking to a random point in the file and then seeking to next ''record block''.