You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Larry Hastings (JIRA)" <ji...@apache.org> on 2009/02/05 07:15:59 UTC

[jira] Commented: (THRIFT-110) A more compact format

    [ https://issues.apache.org/jira/browse/THRIFT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670639#action_12670639 ] 

Larry Hastings commented on THRIFT-110:
---------------------------------------

I'm not a Thrift guy, but I just talked with David Reiss about this thread.  I have a couple more crazy suggestions--but he tells me they'd require compiler changes.

First: allow casting from bool to int, so that you can send the integer values 0 and 1 as boolean-false and boolean-true respectively.


Second: make up your mind--are you using zigzag ints or not?  If you are, you only need *one* integer type.  I think it'd be better to go the other way: have int types for int-8, int-16, int-24, int-32, int-40, int-64.  That would give back one additional bit per byte of the ints as you'd no longer need the zigzag marker bit.


Finally: this still leaves one type-header value for future expansion, which I suggest should be explicitly defined as "followed by a variable-length type-header value".  Even if you dropped zigzag ints for integer values, they might be worth keeping here.

I'm gonna mark up type-and-id below; for clarity I'm going to put square brackets around individual bytes.

If we use zigzag ints here, type-and-id becomes:

type-and-id
=> [ field-id-delta type-header ]
   | [ 0 type-header ] zigzag-varint
   | [ field-id-delta 0xF ] zigzag-varint
   | [ 0 0xF ] zigzag-varint zigzag-varint

If we remove zigzags entirely, then for extended field deltas or types we must follow the type-and-id with a byte containing two type-id-headers: the high one for the extended field-id-delta, and the low one for type-header.  If either isn't used, set those four bits to 0.

type-and-id
=> [ field-id-delta type-header ]
   | [ 0 type-header ] [ field-id-delta-int-type-header 0 ] n-bit-int
   | [ field-id-delta 0xF ] [ 0 type-header-int-type-header ] n-bit-int
   | [ 0 0xF ] [ field-id-delta-int-type-header type-header-int-type-header ] n-bit-int n-bit-int

In terms of compression, I suspect leaving the zigzag ints here is a win; given that real-world use would likely never see extended type-headers, the only variant we'd see was a field-id-delta that didn't fit in the range 1-15.  In that case, zigzag ints would be strictly either smaller or the same size as the second approach.

I'm probably crazy,


/larry/

> A more compact format 
> ----------------------
>
>                 Key: THRIFT-110
>                 URL: https://issues.apache.org/jira/browse/THRIFT-110
>             Project: Thrift
>          Issue Type: Improvement
>            Reporter: Noble Paul
>            Assignee: Bryan Duxbury
>         Attachments: compact-proto-spec-2.txt, compact_proto_spec.txt, compact_proto_spec.txt, thrift-110-v2.patch, thrift-110-v3.patch, thrift-110-v4.patch, thrift-110-v5.patch, thrift-110-v6.patch, thrift-110-v7.patch, thrift-110-v8.patch, thrift-110-v9.patch, thrift-110.patch
>
>
> Thrift is not very compact in writing out data as (say protobuf) . It does not have the concept of variable length integers and various other optimizations possible . In Solr we use a lot of such optimizations to make a very compact payload. Thrift has a lot common with that format.
> It is all done in a single class
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/common/util/NamedListCodec.java?revision=685640&view=markup
> The other optimizations include writing type/value  in same byte, very fast writes of Strings, externalizable strings etc 
> We could use a thrift format for non-java clients and I would like to see it as compact as the current java version

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.