You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Pete Wyckoff (JIRA)" <ji...@apache.org> on 2008/08/19 01:03:44 UTC

[jira] Created: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
-------------------------------------------------------------------------------------------------------------------------------

                 Key: THRIFT-111
                 URL: https://issues.apache.org/jira/browse/THRIFT-111
             Project: Thrift
          Issue Type: New Feature
            Reporter: Pete Wyckoff
            Priority: Minor



Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)

TRecordStream is a Thrift transport that encodes data in a format
suitable for storage in a file (not synchronous communication).

TRecordStream achieves following design goals:
- Be self-describing and extensible.  A file containing a TRecordStream
  must contain enough metadata for an application to read it with no other
  context.  It should be possible to add new features without breaking
  backwards and forwards compatibility.  It should be possible to completely
  change the format without confusing old or programs.
- Be robust against disk corruption.  All data and metadata must (optionally)
  be checksummed.  It must be possible to recover and continue reading
  uncorrupted data after corruption is encountered.
- Be (optionally) human-readable.  TRecordStream will also be used for
  plan-text, line-oriented, human-readable data.  Allowing a plain-text,
  line-oriented, human-readable header format will be advantageous for this
  use case.
- Support asynchronous file I/O.  This feature will not be implemented in the
  first version of TRecordStream, but the implementation must support
  the eventual inclusion of this feature.
- Be performant.  No significant sacrifice of speed should be made in order to
  achieve any of the other design goals.
- Support small synchronous writes



TRecordStream will not do any I/O itself, but will instead focus on
preparing the data format and depend on an underlying transport (TFDTransport,
for example) to write the data to a file.

TRecordStream will have two distinct formats: binary and plain text.

Binary-format streams shall begin with a format version number, encoded as a
32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
first byte of a TRecordStream will always be 0. The version number
shall be repeated once to guard against corruption.  If the two copies of the
version number do not match, the stream must be considered corrupt, and
recovery should proceed as described below (TODO).

Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
after the colon), followed by the decimal form of the version number
(ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
version line shall be repeated.

This document describes version 1 of the format.  Version 1 streams are
composed of series of chunks.  Variable-length chunks are supported, but their
use is discoraged because they make recovering from corrupt chunk headers
difficult.  Each chunk begins with the redundant version identifiers described
above.

Following the version numbers, a binary-format stream shall contain the
following fields, in order and with no padding:
- The (32-bit) CRC-32 of the header length + header data.
- The 32-bit big endian header length.
- A variable-length header, which is a TBinaryProtocol-serialized Thrift
  structure (whose exact structure is defined in
  robust_offline_stream.thrift).

A plain-text stream should follow the versions with:
- The string "Header-Checksum: "
- The eight-character (leading-zero-padded) hexadecimal encoding of the
  unsigned CRC-32 of the header (which does *not* include the CRC-32).
- A linefeed (0x0a).
- A header consisting of zero or more entries, where each entry consists of
  - An entry name, which is an ASCII string consisting of alphanumeric
    characters, dashes ("-"), underscores, and periods (full-stops).
  - A colon followed by a space.
  - An entry value, which is a printable ASCII string not including any
    linefeeds.
  - A linefeed.
- A linefeed.

Header entry names may be repeated.  The handling of repeated names is
dependent on the particular name.  Unless otherwise specified, all entries
with a given name other than the last are ignored.

The actual data will be stored in sub-chunks, which may optionally be
compressed.  (The chunk header will define the compression format used.)  The
chunk header will specify the following fields for each sub-chunk:
 - (optional) Offset within the chunk.  If ommitted, it should be assumed to
   immediately follow the previous sub-chunk.
 - (required) Length of the (optionally) compressed sub-chunk.  This is the
   physical number of bytes in the stream taken up by the sub-cunk.
 - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
   hint.
 - (optional) CRC-32 of the (optionally compressed) sub-chunk.
 - (optional) CRC-32 of the uncompressed sub-chunk.

If no compression format is specified, the sub-chunks should be assumed to be
in "raw" format.

*/


namespace cpp    facebook.thrift.transport.record_stream
namespace java   com.facebook.thrift.transport.recrod_stream
namespace python thrift.transport.recrod_stream


/*
 * enums in plain-text headers should be represented as strings, not numbers.
 * Each enum value should specify the string used in plain text.
 */

enum CompressionType {
  /**
   * "raw": No compression.
   *
   * The data written to the TRecordStream object appears byte-for-byte
   * in the stream.  Raw format streams ignore the uncompressed length and
   * uncompressed checksum of the sub-chunks.  It is strongly advised to use
   * checksums when writing raw sub-chunks.
   */
  COMPRESSION_RAW = 0,

  /**
   * "zlib": zlib compression.
   *
   * The compressed data is a zlib stream compressed with the "deflate"
   * algorithm.  This format is specified by RFCs 1950 and 1951, and is
   * produced by zlib's "compress" or "deflate" functions.  Note that this is
   * *not* a raw "deflate" stream nor a gzip file.
   */
  COMPRESSION_ZLIB = 1,
}

enum RecordType {
  /**
   * (Absent in plain text.) Unspecified record type.
   */
  RECORD_UNKNOWN = 0,

  /**
   * "struct": Thrift structures, serialized back-to-back.
   */
  RECORD_STRUCT = 1,

  /**
   * "call": Thrift method calls, produced by send_method();
   */
  RECORD_CALL = 2,

  /**
   * "lines": Line-oriented text data.
   */
  RECORD_LINES = 3,
}

enum ProtocolType {
  /** (Absent in plain text.) */
  PROTOCOL_UNKNOWN     = 0;
  /** "binary" */
  PROTOCOL_BINARY      = 1;
  /** "dense" */
  PROTOCOL_DENSE       = 2;
  /** "json" */
  PROTOCOL_JSON        = 3;
  /** "simple_json" */
  PROTOCOL_SIMPLE_JSON = 4;
  /** "csv" */
  PROTOCOL_CSV         = 5;
}

/**
 * The structure used to represent metadata about a sub-chunk.
 * In plain text, this structure is included as the value of a "Sub-Chunk"
 * header entry.  Each of these fields should be included, represented
 * according to the comment for ChunkHeader.  Fields should be in order and
 * separated by a single space.  Absent fields should be included as a single
 * dash ("-").
 */
struct SubChunkHeader {
  1: optional i32 offset;
  2: required i32 length;
  3: optional i32 checksum;
  4: optional i32 uncompressed_length;
  5: optional i32 uncompressed_checksum;
}

/**
 * This is the top-level structure encoded as the chunk header.
 * Unless otherwise specified, field will be represented in plain text by
 * uppercasing each word in the field name and replacing underscores with
 * hyphens, producing the field name.  Integers should be ASCII-encoded
 * decimal, except for checksums which should be ASCII-encoded hexadecimal
 * unsigned.
 */
struct ChunkHeader {
  /**
   * Number of bytes per chunk.
   * Recommended to be a power of 2.
   */
  1: required i32 chunk_size;

  /**
   * Type of compression used for sub-chunks.
   * Assumed to be RAW if absent.
   */
  3: optional CompressionType compression_type = COMPRESSION_RAW;

  /**
   * Type of records encoded in the sub-chunks.
   * This information is made accessible to applications,
   * but is otherwise uninterpreted by the transport.
   */
  4: optional RecordType record_type = RECORD_UNKNOWN;

  /**
   * Protocol used for serializing records.
   * This information is made accessible to applications,
   * but is otherwise uninterpreted by the transport.
   */
  5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;

  /**
   * The metadata for the individual sub-chunks,
   * in the order they should be read.
   *
   * In the plain-text format, each of these is written as a separate
   * "Sub-Chunk" header entry, in order.
   */
  2: required list<SubChunkHeader> sub_chunk_headers;
}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623755#action_12623755 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------

Note also that an implementation goal is to not spawn threads to do writes to the underlying transport (like TFileTransport does now).


> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> */
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "David Reiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624964#action_12624964 ] 

David Reiss commented on THRIFT-111:
------------------------------------

My thought would be that the TRS would buffer internally and use the zlib transport to fill a sub block, then re-initialize the zlib transport.  So the inner transport would not have to be aware of sub-blocks at all.  I think that all framing and sub-block management should be handled by the TRS so no inner transports need to be aware of it.

> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649205#action_12649205 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------

I just realized the reversed serialized header at the end of the frame does not work well for the plain text mode.  But since those headers aren't meant to be human usable anyway, as long as they can be grepped out still, it may not be an issue.


> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624452#action_12624452 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------


RE: sub chunk headers
If we ax the requirement of keeping the sub-block headers with the block header (note I'm changing terminology), then buffering is not needed and things become a lot simpler. The main reason for this requirement was to keep data loss minimal in the face of a CRC check error for the overall block. But, if we keep blocks relatively small - e.g., 16 MBs, since data errors are pretty infrequent, I don't think this will be an issue.

The advantage of this approach is we can decouple the sub-block writer from the block writer and the following clean, simple design is possible.



 
{code:title=TFixedFrameTransport.h|borderStyle=solid}
class TFixedFrameHeader {
  public TFixedFrameHeader(string subBlockTransport, size_t blockSize, ....);
  public size_t writeHeader(size_t offsetToFirstNewSubBlock, TTransport otrans) ; 
}

class TFixedFrameTransport {
   public TFixedFrameTransport(TTransport otrans, TRecordStreamHeader header, size_t blockSize);
   // writes to otrans in blocks with header and then blockSize - sizeof header data
   public void write(buf, length) {
   int left = length; 
   while(left >0)  {
     size_t towrite = min(left, blockSize - cur);
      if(towrite > 0) 
        otrans.write(buf, towrite);
     cur += towrite;
     left -= towrite;
     if(left >0) {
        cur += writeHeader(); 
     }
   }
}
  
   public buf read(size);
   ..
}
{code}

{code title=TChecksumTransport.h|boderStyle=solid}

class TChecksumTransport {
   // here the block size is for sub-blocks in the context of TRecordStream/TFixedFrameTransport
   public TChecksumTransport(TTransport otrans, size_t blockSize);

}
{code}

{code title=TZlibTransport.h|borderStyle=solid}

class TZlibTransport {
  //  dreiss has already implemented this and we just need a flush method to force it to end a zlib stream
}

{code}

{code title=TRecordStreamTransport.h|borderStyle=solid}
 class TRecordStreamTransport {
   public TRecordStreamTransport(string subBlockTransport, TTransport otrans, size_t blockSize, size_t subBlockSize, ...) {
      TTransport fixedFrameTrans = new TFixedFrameTransport(otrans, header, blockSize);
      if(subBlockTransport == "tzlib") { 
        subBlockTrans = new TZlibTransport(fixedFrameTrans, subBlockSize);
      } else if(subBlockTransport == "tchecksum") {
        // and so on
      }
  }
  public void write(buf, length) {
   subBlockTrans->write(buf, length);
   }
  // etc
}
{code}

And that's about it. So we need TFixedFrame, TChecksum, TZlib and TRecordStream, all 4 of which are relatively straight forward and contained (although TZlib isn't so easy to implement).

h4. Problems

A problem this approach has is there will almost always certainly be spill over from one block to the next. You may not want this for one of 2 reasons. (1) you have multiple writers and are appending and thus may need to write incomplete blocks with padding. This could mean significant waste since the last block could very well be underused. (2) you want to be able to later process each block independently for example by Hadoop.

To fix this, TRecordStream's write method would have to be cognizant of the remaining bytes left in each block so as to try and make the inner blocks make as much use of the block as possible.  This can be an add on. We would probably need this in only C++ and Java.

comments?


> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated THRIFT-111:
--------------------------------

    Description: 
Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)

TRecordStream is a Thrift transport that encodes data in a format
suitable for storage in a file (not synchronous communication).

TRecordStream achieves following design goals:
- Be self-describing and extensible.  A file containing a TRecordStream
  must contain enough metadata for an application to read it with no other
  context.  It should be possible to add new features without breaking
  backwards and forwards compatibility.  It should be possible to completely
  change the format without confusing old or programs.
- Be robust against disk corruption.  All data and metadata must (optionally)
  be checksummed.  It must be possible to recover and continue reading
  uncorrupted data after corruption is encountered.
- Be (optionally) human-readable.  TRecordStream will also be used for
  plan-text, line-oriented, human-readable data.  Allowing a plain-text,
  line-oriented, human-readable header format will be advantageous for this
  use case.
- Support asynchronous file I/O.  This feature will not be implemented in the
  first version of TRecordStream, but the implementation must support
  the eventual inclusion of this feature.
- Be performant.  No significant sacrifice of speed should be made in order to
  achieve any of the other design goals.
- Support small synchronous writes



TRecordStream will not do any I/O itself, but will instead focus on
preparing the data format and depend on an underlying transport (TFDTransport,
for example) to write the data to a file.

TRecordStream will have two distinct formats: binary and plain text.

Binary-format streams shall begin with a format version number, encoded as a
32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
first byte of a TRecordStream will always be 0. The version number
shall be repeated once to guard against corruption.  If the two copies of the
version number do not match, the stream must be considered corrupt, and
recovery should proceed as described below (TODO).

Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
after the colon), followed by the decimal form of the version number
(ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
version line shall be repeated.

This document describes version 1 of the format.  Version 1 streams are
composed of series of chunks.  Variable-length chunks are supported, but their
use is discoraged because they make recovering from corrupt chunk headers
difficult.  Each chunk begins with the redundant version identifiers described
above.

Following the version numbers, a binary-format stream shall contain the
following fields, in order and with no padding:
- The (32-bit) CRC-32 of the header length + header data.
- The 32-bit big endian header length.
- A variable-length header, which is a TBinaryProtocol-serialized Thrift
  structure (whose exact structure is defined in
  robust_offline_stream.thrift).

A plain-text stream should follow the versions with:
- The string "Header-Checksum: "
- The eight-character (leading-zero-padded) hexadecimal encoding of the
  unsigned CRC-32 of the header (which does *not* include the CRC-32).
- A linefeed (0x0a).
- A header consisting of zero or more entries, where each entry consists of
  - An entry name, which is an ASCII string consisting of alphanumeric
    characters, dashes ("-"), underscores, and periods (full-stops).
  - A colon followed by a space.
  - An entry value, which is a printable ASCII string not including any
    linefeeds.
  - A linefeed.
- A linefeed.

Header entry names may be repeated.  The handling of repeated names is
dependent on the particular name.  Unless otherwise specified, all entries
with a given name other than the last are ignored.

The actual data will be stored in sub-chunks, which may optionally be
compressed.  (The chunk header will define the compression format used.)  The
chunk header will specify the following fields for each sub-chunk:
 - (optional) Offset within the chunk.  If ommitted, it should be assumed to
   immediately follow the previous sub-chunk.
 - (required) Length of the (optionally) compressed sub-chunk.  This is the
   physical number of bytes in the stream taken up by the sub-cunk.
 - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
   hint.
 - (optional) CRC-32 of the (optionally compressed) sub-chunk.
 - (optional) CRC-32 of the uncompressed sub-chunk.

If no compression format is specified, the sub-chunks should be assumed to be
in "raw" format.

{code:title=TRecordStream.thrift|borderStyle=solid}


namespace cpp    facebook.thrift.transport.record_stream
namespace java   com.facebook.thrift.transport.recrod_stream
namespace python thrift.transport.recrod_stream


/*
 * enums in plain-text headers should be represented as strings, not numbers.
 * Each enum value should specify the string used in plain text.
 */

enum CompressionType {
  /**
   * "raw": No compression.
   *
   * The data written to the TRecordStream object appears byte-for-byte
   * in the stream.  Raw format streams ignore the uncompressed length and
   * uncompressed checksum of the sub-chunks.  It is strongly advised to use
   * checksums when writing raw sub-chunks.
   */
  COMPRESSION_RAW = 0,

  /**
   * "zlib": zlib compression.
   *
   * The compressed data is a zlib stream compressed with the "deflate"
   * algorithm.  This format is specified by RFCs 1950 and 1951, and is
   * produced by zlib's "compress" or "deflate" functions.  Note that this is
   * *not* a raw "deflate" stream nor a gzip file.
   */
  COMPRESSION_ZLIB = 1,
}

enum RecordType {
  /**
   * (Absent in plain text.) Unspecified record type.
   */
  RECORD_UNKNOWN = 0,

  /**
   * "struct": Thrift structures, serialized back-to-back.
   */
  RECORD_STRUCT = 1,

  /**
   * "call": Thrift method calls, produced by send_method();
   */
  RECORD_CALL = 2,

  /**
   * "lines": Line-oriented text data.
   */
  RECORD_LINES = 3,
}

enum ProtocolType {
  /** (Absent in plain text.) */
  PROTOCOL_UNKNOWN     = 0;
  /** "binary" */
  PROTOCOL_BINARY      = 1;
  /** "dense" */
  PROTOCOL_DENSE       = 2;
  /** "json" */
  PROTOCOL_JSON        = 3;
  /** "simple_json" */
  PROTOCOL_SIMPLE_JSON = 4;
  /** "csv" */
  PROTOCOL_CSV         = 5;
}

/**
 * The structure used to represent metadata about a sub-chunk.
 * In plain text, this structure is included as the value of a "Sub-Chunk"
 * header entry.  Each of these fields should be included, represented
 * according to the comment for ChunkHeader.  Fields should be in order and
 * separated by a single space.  Absent fields should be included as a single
 * dash ("-").
 */
struct SubChunkHeader {
  1: optional i32 offset;
  2: required i32 length;
  3: optional i32 checksum;
  4: optional i32 uncompressed_length;
  5: optional i32 uncompressed_checksum;
}

/**
 * This is the top-level structure encoded as the chunk header.
 * Unless otherwise specified, field will be represented in plain text by
 * uppercasing each word in the field name and replacing underscores with
 * hyphens, producing the field name.  Integers should be ASCII-encoded
 * decimal, except for checksums which should be ASCII-encoded hexadecimal
 * unsigned.
 */
struct ChunkHeader {
  /**
   * Number of bytes per chunk.
   * Recommended to be a power of 2.
   */
  1: required i32 chunk_size;

  /**
   * Type of compression used for sub-chunks.
   * Assumed to be RAW if absent.
   */
  3: optional CompressionType compression_type = COMPRESSION_RAW;

  /**
   * Type of records encoded in the sub-chunks.
   * This information is made accessible to applications,
   * but is otherwise uninterpreted by the transport.
   */
  4: optional RecordType record_type = RECORD_UNKNOWN;

  /**
   * Protocol used for serializing records.
   * This information is made accessible to applications,
   * but is otherwise uninterpreted by the transport.
   */
  5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;

  /**
   * The metadata for the individual sub-chunks,
   * in the order they should be read.
   *
   * In the plain-text format, each of these is written as a separate
   * "Sub-Chunk" header entry, in order.
   */
  2: required list<SubChunkHeader> sub_chunk_headers;
}
{code}


  was:

Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)

TRecordStream is a Thrift transport that encodes data in a format
suitable for storage in a file (not synchronous communication).

TRecordStream achieves following design goals:
- Be self-describing and extensible.  A file containing a TRecordStream
  must contain enough metadata for an application to read it with no other
  context.  It should be possible to add new features without breaking
  backwards and forwards compatibility.  It should be possible to completely
  change the format without confusing old or programs.
- Be robust against disk corruption.  All data and metadata must (optionally)
  be checksummed.  It must be possible to recover and continue reading
  uncorrupted data after corruption is encountered.
- Be (optionally) human-readable.  TRecordStream will also be used for
  plan-text, line-oriented, human-readable data.  Allowing a plain-text,
  line-oriented, human-readable header format will be advantageous for this
  use case.
- Support asynchronous file I/O.  This feature will not be implemented in the
  first version of TRecordStream, but the implementation must support
  the eventual inclusion of this feature.
- Be performant.  No significant sacrifice of speed should be made in order to
  achieve any of the other design goals.
- Support small synchronous writes



TRecordStream will not do any I/O itself, but will instead focus on
preparing the data format and depend on an underlying transport (TFDTransport,
for example) to write the data to a file.

TRecordStream will have two distinct formats: binary and plain text.

Binary-format streams shall begin with a format version number, encoded as a
32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
first byte of a TRecordStream will always be 0. The version number
shall be repeated once to guard against corruption.  If the two copies of the
version number do not match, the stream must be considered corrupt, and
recovery should proceed as described below (TODO).

Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
after the colon), followed by the decimal form of the version number
(ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
version line shall be repeated.

This document describes version 1 of the format.  Version 1 streams are
composed of series of chunks.  Variable-length chunks are supported, but their
use is discoraged because they make recovering from corrupt chunk headers
difficult.  Each chunk begins with the redundant version identifiers described
above.

Following the version numbers, a binary-format stream shall contain the
following fields, in order and with no padding:
- The (32-bit) CRC-32 of the header length + header data.
- The 32-bit big endian header length.
- A variable-length header, which is a TBinaryProtocol-serialized Thrift
  structure (whose exact structure is defined in
  robust_offline_stream.thrift).

A plain-text stream should follow the versions with:
- The string "Header-Checksum: "
- The eight-character (leading-zero-padded) hexadecimal encoding of the
  unsigned CRC-32 of the header (which does *not* include the CRC-32).
- A linefeed (0x0a).
- A header consisting of zero or more entries, where each entry consists of
  - An entry name, which is an ASCII string consisting of alphanumeric
    characters, dashes ("-"), underscores, and periods (full-stops).
  - A colon followed by a space.
  - An entry value, which is a printable ASCII string not including any
    linefeeds.
  - A linefeed.
- A linefeed.

Header entry names may be repeated.  The handling of repeated names is
dependent on the particular name.  Unless otherwise specified, all entries
with a given name other than the last are ignored.

The actual data will be stored in sub-chunks, which may optionally be
compressed.  (The chunk header will define the compression format used.)  The
chunk header will specify the following fields for each sub-chunk:
 - (optional) Offset within the chunk.  If ommitted, it should be assumed to
   immediately follow the previous sub-chunk.
 - (required) Length of the (optionally) compressed sub-chunk.  This is the
   physical number of bytes in the stream taken up by the sub-cunk.
 - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
   hint.
 - (optional) CRC-32 of the (optionally compressed) sub-chunk.
 - (optional) CRC-32 of the uncompressed sub-chunk.

If no compression format is specified, the sub-chunks should be assumed to be
in "raw" format.

*/


namespace cpp    facebook.thrift.transport.record_stream
namespace java   com.facebook.thrift.transport.recrod_stream
namespace python thrift.transport.recrod_stream


/*
 * enums in plain-text headers should be represented as strings, not numbers.
 * Each enum value should specify the string used in plain text.
 */

enum CompressionType {
  /**
   * "raw": No compression.
   *
   * The data written to the TRecordStream object appears byte-for-byte
   * in the stream.  Raw format streams ignore the uncompressed length and
   * uncompressed checksum of the sub-chunks.  It is strongly advised to use
   * checksums when writing raw sub-chunks.
   */
  COMPRESSION_RAW = 0,

  /**
   * "zlib": zlib compression.
   *
   * The compressed data is a zlib stream compressed with the "deflate"
   * algorithm.  This format is specified by RFCs 1950 and 1951, and is
   * produced by zlib's "compress" or "deflate" functions.  Note that this is
   * *not* a raw "deflate" stream nor a gzip file.
   */
  COMPRESSION_ZLIB = 1,
}

enum RecordType {
  /**
   * (Absent in plain text.) Unspecified record type.
   */
  RECORD_UNKNOWN = 0,

  /**
   * "struct": Thrift structures, serialized back-to-back.
   */
  RECORD_STRUCT = 1,

  /**
   * "call": Thrift method calls, produced by send_method();
   */
  RECORD_CALL = 2,

  /**
   * "lines": Line-oriented text data.
   */
  RECORD_LINES = 3,
}

enum ProtocolType {
  /** (Absent in plain text.) */
  PROTOCOL_UNKNOWN     = 0;
  /** "binary" */
  PROTOCOL_BINARY      = 1;
  /** "dense" */
  PROTOCOL_DENSE       = 2;
  /** "json" */
  PROTOCOL_JSON        = 3;
  /** "simple_json" */
  PROTOCOL_SIMPLE_JSON = 4;
  /** "csv" */
  PROTOCOL_CSV         = 5;
}

/**
 * The structure used to represent metadata about a sub-chunk.
 * In plain text, this structure is included as the value of a "Sub-Chunk"
 * header entry.  Each of these fields should be included, represented
 * according to the comment for ChunkHeader.  Fields should be in order and
 * separated by a single space.  Absent fields should be included as a single
 * dash ("-").
 */
struct SubChunkHeader {
  1: optional i32 offset;
  2: required i32 length;
  3: optional i32 checksum;
  4: optional i32 uncompressed_length;
  5: optional i32 uncompressed_checksum;
}

/**
 * This is the top-level structure encoded as the chunk header.
 * Unless otherwise specified, field will be represented in plain text by
 * uppercasing each word in the field name and replacing underscores with
 * hyphens, producing the field name.  Integers should be ASCII-encoded
 * decimal, except for checksums which should be ASCII-encoded hexadecimal
 * unsigned.
 */
struct ChunkHeader {
  /**
   * Number of bytes per chunk.
   * Recommended to be a power of 2.
   */
  1: required i32 chunk_size;

  /**
   * Type of compression used for sub-chunks.
   * Assumed to be RAW if absent.
   */
  3: optional CompressionType compression_type = COMPRESSION_RAW;

  /**
   * Type of records encoded in the sub-chunks.
   * This information is made accessible to applications,
   * but is otherwise uninterpreted by the transport.
   */
  4: optional RecordType record_type = RECORD_UNKNOWN;

  /**
   * Protocol used for serializing records.
   * This information is made accessible to applications,
   * but is otherwise uninterpreted by the transport.
   */
  5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;

  /**
   * The metadata for the individual sub-chunks,
   * in the order they should be read.
   *
   * In the plain-text format, each of these is written as a separate
   * "Sub-Chunk" header entry, in order.
   */
  2: required list<SubChunkHeader> sub_chunk_headers;
}



> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624528#action_12624528 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------

bq. sub-block headers are all grouped together so they can be guarded with a single checksum. Here they would have to be checksummed individually, and that would have to be handled by (for example) the zlib layer.

If tzlib or tfoo needs to know the size of the sub-block, I guess it would need to use tframetransport to read/write frames,  in which case, we would be in trouble since it doesn't do checksums on its header (ie the length of the frame).  might be more convenient to offer the option of read/writeFrame in TFixedFrameTransport itself in which case, it would write the header info and could do it with a checksum.

Is there an advantage to writing all the sub-block headers together? in which case, we could force the use of read/writeFrame, but the thing I don't like about that is we then force knowledge of using TFixedFrameTransport on the inner transport and we lose the loose coupling and advantage of easily integrating new transports into TRS. 

re: the start of the first complete sub-block, yes, I think that should be in the block header.   




> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624452#action_12624452 ] 

wyckoff edited comment on THRIFT-111 at 8/21/08 12:10 PM:
---------------------------------------------------------------

RE: sub chunk headers
If we ax the requirement of keeping the sub-block headers with the block header (note I'm changing terminology), then buffering is not needed and things become a lot simpler. The main reason for this requirement was to keep data loss minimal in the face of a CRC check error for the overall block. But, if we keep blocks relatively small - e.g., 16 MBs, since data errors are pretty infrequent, I don't think this will be an issue.

The advantage of this approach is we can decouple the sub-block writer from the block writer and the following clean, simple design is possible.



 
{code:title=TFixedFrameTransport.h|borderStyle=solid}
class TFixedFrameHeader {
  public TFixedFrameHeader(string subBlockTransport, size_t blockSize, ....);
  public size_t writeHeader(size_t offsetToFirstNewSubBlock, TTransport otrans) ; 
}

class TFixedFrameTransport {
   public TFixedFrameTransport(TTransport otrans, TRecordStreamHeader header, size_t blockSize);
   // writes to otrans in blocks with header and then blockSize - sizeof header data
   public void write(buf, length) {
   int left = length; 
   while(left >0)  {
     size_t towrite = min(left, blockSize - cur);
      if(towrite > 0) 
        otrans.write(buf, towrite);
     cur += towrite;
     left -= towrite;
     if(left >0) {
        cur += writeHeader(); 
     }
   }
}
  
   public buf read(size);
   ..
}

{code}

{code:title=TChecksumTransport.h|boderStyle=solid}

class TChecksumTransport {
   // here the block size is for sub-blocks in the context of TRecordStream/TFixedFrameTransport
   public TChecksumTransport(TTransport otrans, size_t blockSize);
}

{code}

{code:title=TZlibTransport.h|borderStyle=solid}

class TZlibTransport {
  //  dreiss has already implemented this and we just need a flush method to force it to end a zlib stream
}

{code}

{code:title=TRecordStreamTransport.h|borderStyle=solid}
 class TRecordStreamTransport {
   public TRecordStreamTransport(string subBlockTransport, TTransport otrans, size_t blockSize, size_t subBlockSize, ...) {
      TTransport fixedFrameTrans = new TFixedFrameTransport(otrans, header, blockSize);
      if(subBlockTransport == "tzlib") { 
        subBlockTrans = new TZlibTransport(fixedFrameTrans, subBlockSize);
      } else if(subBlockTransport == "tchecksum") {
        // and so on
      }
  }
  public void write(buf, length) {
   subBlockTrans->write(buf, length);
   }
  // etc
}

{code}

And that's about it. So we need TFixedFrame, TChecksum, TZlib and TRecordStream, all 4 of which are relatively straight forward and contained (although TZlib isn't so easy to implement).

h4. Problems

A problem this approach has is there will almost always certainly be spill over from one block to the next. You may not want this for one of 2 reasons. (1) you have multiple writers and are appending and thus may need to write incomplete blocks with padding. This could mean significant waste since the last block could very well be underused. (2) you want to be able to later process each block independently for example by Hadoop.

To fix this, TRecordStream's write method would have to be cognizant of the remaining bytes left in each block so as to try and make the inner blocks make as much use of the block as possible.  This can be an add on. We would probably need this in only C++ and Java.

comments?


      was (Author: wyckoff):
    RE: sub chunk headers
If we ax the requirement of keeping the sub-block headers with the block header (note I'm changing terminology), then buffering is not needed and things become a lot simpler. The main reason for this requirement was to keep data loss minimal in the face of a CRC check error for the overall block. But, if we keep blocks relatively small - e.g., 16 MBs, since data errors are pretty infrequent, I don't think this will be an issue.

The advantage of this approach is we can decouple the sub-block writer from the block writer and the following clean, simple design is possible.



 
{code:title=TFixedFrameTransport.h|borderStyle=solid}
class TFixedFrameHeader {
  public TFixedFrameHeader(string subBlockTransport, size_t blockSize, ....);
  public size_t writeHeader(size_t offsetToFirstNewSubBlock, TTransport otrans) ; 
}

class TFixedFrameTransport {
   public TFixedFrameTransport(TTransport otrans, TRecordStreamHeader header, size_t blockSize);
   // writes to otrans in blocks with header and then blockSize - sizeof header data
   public void write(buf, length) {
   int left = length; 
   while(left >0)  {
     size_t towrite = min(left, blockSize - cur);
      if(towrite > 0) 
        otrans.write(buf, towrite);
     cur += towrite;
     left -= towrite;
     if(left >0) {
        cur += writeHeader(); 
     }
   }
}
  
   public buf read(size);
   ..
}

{code}

{code title=TChecksumTransport.h|boderStyle=solid}

class TChecksumTransport {
   // here the block size is for sub-blocks in the context of TRecordStream/TFixedFrameTransport
   public TChecksumTransport(TTransport otrans, size_t blockSize);
}

{code}

{code title=TZlibTransport.h|borderStyle=solid}

class TZlibTransport {
  //  dreiss has already implemented this and we just need a flush method to force it to end a zlib stream
}

{code}

{code title=TRecordStreamTransport.h|borderStyle=solid}
 class TRecordStreamTransport {
   public TRecordStreamTransport(string subBlockTransport, TTransport otrans, size_t blockSize, size_t subBlockSize, ...) {
      TTransport fixedFrameTrans = new TFixedFrameTransport(otrans, header, blockSize);
      if(subBlockTransport == "tzlib") { 
        subBlockTrans = new TZlibTransport(fixedFrameTrans, subBlockSize);
      } else if(subBlockTransport == "tchecksum") {
        // and so on
      }
  }
  public void write(buf, length) {
   subBlockTrans->write(buf, length);
   }
  // etc
}

{code}

And that's about it. So we need TFixedFrame, TChecksum, TZlib and TRecordStream, all 4 of which are relatively straight forward and contained (although TZlib isn't so easy to implement).

h4. Problems

A problem this approach has is there will almost always certainly be spill over from one block to the next. You may not want this for one of 2 reasons. (1) you have multiple writers and are appending and thus may need to write incomplete blocks with padding. This could mean significant waste since the last block could very well be underused. (2) you want to be able to later process each block independently for example by Hadoop.

To fix this, TRecordStream's write method would have to be cognizant of the remaining bytes left in each block so as to try and make the inner blocks make as much use of the block as possible.  This can be an add on. We would probably need this in only C++ and Java.

comments?

  
> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624963#action_12624963 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------

bq. don't think that having the sub-block lengths managed by the outer TRS will force any tighter coupling.

in the sense that it imposes the sub-block notion on the inner protocol. e.g., if i just wanted to use something like TFrameTransportWithChecksums on each record as the inner protocol, my sub-blocks would be one record in length.   Unless we did something like buffer writes from the inner protocol and then have TRS.write decide on sub-block boundaries. But, then in some cases, they could become somewhat arbitrary.  IT does seem a little weird checksumming just one i32, I agree, but that i32 is really the sub-block header and could later be more complex and as such should be some kind of ddl-ish type thing.  {i16 version, i32 length }.

I am not fundamentally opposed, I just think it forces buffering on something that needn't have it.

I think the fundamental issue is that some inner protocols may need help framing things and some don't. e.g., TZlibTransport doesn't have the notion of a unit so would need TRS to help it in that regard so it doesn't end of reading past the end up its unit.  (without a bunch of peaking).  But things like TFramedChecksum would have it built in.  So in reality, TZlib would be doing buffering since we'd need to wrap it in TFramed or something, in which case my argument about buffering doesn't really hold.


> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "David Reiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624556#action_12624556 ] 

David Reiss commented on THRIFT-111:
------------------------------------

Putting all of the sub-block sizes together means that we can checksum them all at once.  Checksumming a 4-byte length is just weird.  I don't think that having the sub-block lengths managed by the outer TRS will force any tighter coupling.

> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649192#action_12649192 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------

re: headers.

It might be nice to be able to write the headers as typical thrift structures - other than the data that tells you the protocol.  Other formats like this may write the headers in reverse at the end of the frame.  The advantage here is ease of reading and versioning the headers - esp to implement the TRS in languages like Perl and Python would be much simpler.  And the metadata could be much richer.

2 downsides:

  1. Still need some up front info to know the frame size
   2. Determining the size of the header becomes much more difficult since we want to write FrameSize - header bytes.  With binary protocol it is easier but once we get into JSON and such ...

For  #1, a transport/abstraction that just writes FixedFrames with a header at the front really addresses this.  If TRS  implements subframes and metadata on top of that abstraction it may make sense.
For #2,  we always have some form of this problem since the space to write the subframe length (or offsets) is variable with JSON or plain text. 

Problem #3: For plain text header format, there is no protocol so you cannot serialize/deserialize the header as a thrift struct - the hive project in hadoop has something like a text protocol but it doesn't write out field ids.  We could augment our definition of the plain text header to include field ids though. 



> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624452#action_12624452 ] 

wyckoff edited comment on THRIFT-111 at 8/21/08 12:09 PM:
---------------------------------------------------------------

RE: sub chunk headers
If we ax the requirement of keeping the sub-block headers with the block header (note I'm changing terminology), then buffering is not needed and things become a lot simpler. The main reason for this requirement was to keep data loss minimal in the face of a CRC check error for the overall block. But, if we keep blocks relatively small - e.g., 16 MBs, since data errors are pretty infrequent, I don't think this will be an issue.

The advantage of this approach is we can decouple the sub-block writer from the block writer and the following clean, simple design is possible.



 
{code:title=TFixedFrameTransport.h|borderStyle=solid}
class TFixedFrameHeader {
  public TFixedFrameHeader(string subBlockTransport, size_t blockSize, ....);
  public size_t writeHeader(size_t offsetToFirstNewSubBlock, TTransport otrans) ; 
}

class TFixedFrameTransport {
   public TFixedFrameTransport(TTransport otrans, TRecordStreamHeader header, size_t blockSize);
   // writes to otrans in blocks with header and then blockSize - sizeof header data
   public void write(buf, length) {
   int left = length; 
   while(left >0)  {
     size_t towrite = min(left, blockSize - cur);
      if(towrite > 0) 
        otrans.write(buf, towrite);
     cur += towrite;
     left -= towrite;
     if(left >0) {
        cur += writeHeader(); 
     }
   }
}
  
   public buf read(size);
   ..
}

{code}

{code title=TChecksumTransport.h|boderStyle=solid}

class TChecksumTransport {
   // here the block size is for sub-blocks in the context of TRecordStream/TFixedFrameTransport
   public TChecksumTransport(TTransport otrans, size_t blockSize);
}

{code}

{code title=TZlibTransport.h|borderStyle=solid}

class TZlibTransport {
  //  dreiss has already implemented this and we just need a flush method to force it to end a zlib stream
}

{code}

{code title=TRecordStreamTransport.h|borderStyle=solid}
 class TRecordStreamTransport {
   public TRecordStreamTransport(string subBlockTransport, TTransport otrans, size_t blockSize, size_t subBlockSize, ...) {
      TTransport fixedFrameTrans = new TFixedFrameTransport(otrans, header, blockSize);
      if(subBlockTransport == "tzlib") { 
        subBlockTrans = new TZlibTransport(fixedFrameTrans, subBlockSize);
      } else if(subBlockTransport == "tchecksum") {
        // and so on
      }
  }
  public void write(buf, length) {
   subBlockTrans->write(buf, length);
   }
  // etc
}

{code}

And that's about it. So we need TFixedFrame, TChecksum, TZlib and TRecordStream, all 4 of which are relatively straight forward and contained (although TZlib isn't so easy to implement).

h4. Problems

A problem this approach has is there will almost always certainly be spill over from one block to the next. You may not want this for one of 2 reasons. (1) you have multiple writers and are appending and thus may need to write incomplete blocks with padding. This could mean significant waste since the last block could very well be underused. (2) you want to be able to later process each block independently for example by Hadoop.

To fix this, TRecordStream's write method would have to be cognizant of the remaining bytes left in each block so as to try and make the inner blocks make as much use of the block as possible.  This can be an add on. We would probably need this in only C++ and Java.

comments?


      was (Author: wyckoff):
    
RE: sub chunk headers
If we ax the requirement of keeping the sub-block headers with the block header (note I'm changing terminology), then buffering is not needed and things become a lot simpler. The main reason for this requirement was to keep data loss minimal in the face of a CRC check error for the overall block. But, if we keep blocks relatively small - e.g., 16 MBs, since data errors are pretty infrequent, I don't think this will be an issue.

The advantage of this approach is we can decouple the sub-block writer from the block writer and the following clean, simple design is possible.



 
{code:title=TFixedFrameTransport.h|borderStyle=solid}
class TFixedFrameHeader {
  public TFixedFrameHeader(string subBlockTransport, size_t blockSize, ....);
  public size_t writeHeader(size_t offsetToFirstNewSubBlock, TTransport otrans) ; 
}

class TFixedFrameTransport {
   public TFixedFrameTransport(TTransport otrans, TRecordStreamHeader header, size_t blockSize);
   // writes to otrans in blocks with header and then blockSize - sizeof header data
   public void write(buf, length) {
   int left = length; 
   while(left >0)  {
     size_t towrite = min(left, blockSize - cur);
      if(towrite > 0) 
        otrans.write(buf, towrite);
     cur += towrite;
     left -= towrite;
     if(left >0) {
        cur += writeHeader(); 
     }
   }
}
  
   public buf read(size);
   ..
}
{code}

{code title=TChecksumTransport.h|boderStyle=solid}

class TChecksumTransport {
   // here the block size is for sub-blocks in the context of TRecordStream/TFixedFrameTransport
   public TChecksumTransport(TTransport otrans, size_t blockSize);

}
{code}

{code title=TZlibTransport.h|borderStyle=solid}

class TZlibTransport {
  //  dreiss has already implemented this and we just need a flush method to force it to end a zlib stream
}

{code}

{code title=TRecordStreamTransport.h|borderStyle=solid}
 class TRecordStreamTransport {
   public TRecordStreamTransport(string subBlockTransport, TTransport otrans, size_t blockSize, size_t subBlockSize, ...) {
      TTransport fixedFrameTrans = new TFixedFrameTransport(otrans, header, blockSize);
      if(subBlockTransport == "tzlib") { 
        subBlockTrans = new TZlibTransport(fixedFrameTrans, subBlockSize);
      } else if(subBlockTransport == "tchecksum") {
        // and so on
      }
  }
  public void write(buf, length) {
   subBlockTrans->write(buf, length);
   }
  // etc
}
{code}

And that's about it. So we need TFixedFrame, TChecksum, TZlib and TRecordStream, all 4 of which are relatively straight forward and contained (although TZlib isn't so easy to implement).

h4. Problems

A problem this approach has is there will almost always certainly be spill over from one block to the next. You may not want this for one of 2 reasons. (1) you have multiple writers and are appending and thus may need to write incomplete blocks with padding. This could mean significant waste since the last block could very well be underused. (2) you want to be able to later process each block independently for example by Hadoop.

To fix this, TRecordStream's write method would have to be cognizant of the remaining bytes left in each block so as to try and make the inner blocks make as much use of the block as possible.  This can be an add on. We would probably need this in only C++ and Java.

comments?

  
> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623860#action_12623860 ] 

Pete Wyckoff commented on THRIFT-111:
-------------------------------------

Thinking about implementation, if we created a TFixedSizedChunkTransport which handles writing the headers but other than that just writes out whatever it gets, we could implement TRecordStream with no buffering by just using a TZlibTransport or TCRCTransport on top of TFixedSizedChunkTransport. So, an instantiation would look something like:

TTransport fixedSizedChunkTransport = new TFixedSizedChunkTransport(new TFDTransport(...), chunkSize, headerInfo);
TTransport recordStreamZlib = new TZlibTransport(fixedSizedChunksTransport, bufferSize, ...);
or
TTransport recordStreamChecksum = new TChecksumTransport(fixedSizedChunksTransport, bufferSize, ...);

We may end up needing to actually write a TRecordStream class, but it would be very thin basically relying on subchunks being written by TChecksum or TZlib or something else.

This provides a much simpler implementation and would mean TZlib, TChecksum and TFixedSizedChunks could be re-used in other ways.

-- pete








> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> */
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Posted by "David Reiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624454#action_12624454 ] 

David Reiss commented on THRIFT-111:
------------------------------------

One nice property of the original design is that the sub-block headers are all grouped together so they can be guarded with a single checksum.  Here they would have to be checksummed individually, and that would have to be handled by (for example) the zlib layer.  Also, if we have spillover, there has to be some way to find the start of the first complete sub-block just given the block header.

> TRecordStream: a robust transport for writing records  with (optional) CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc circulated on the public thrift lists under the name TRobustOfflineStream in May 08 with the addition of the requirement of handling small synchronous writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.