You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Randall Leeds <ra...@gmail.com> on 2011/10/04 21:02:02 UTC

Re: Universal Binary JSON in CouchDB

Hey,

Thanks for this thread.

I've been interested in ways to reduce the work from disk to client as well.
Unfortunately, the metadata inside the document objects is variable based on
query parameters (_attachments, _revisions, _revs_info...) so the server
needs to decode the disk binary anyway.

I would say this is something we should carefully consider for a 2.0 api. I
know that, for simplicity, many people really like having the underscore
prefixed attributes mixed in right alongside the document data, but a future
API that separated these could really make things fly.

-Randall

On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau <bc...@gmail.com> wrote:

> On Thursday, September 29, 2011, Riyad Kalla <rk...@gmail.com> wrote:
> > DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a
> > rush,
> > just check the last 2 sections and see if it sounds interesting.
> >
> >
> > Hi everybody. I am new to the list, but a big fan of Couch and I have
> been
> > working
> > on something I wanted to share with the group.
> >
> > My appologies if this isn't the right venue or list ediquette... I wasn't
> > really
> > sure where to start with this conversation.
> >
> >
> > Background
> > =====================
> > With the help of the JSON spec community I've been finalizing a
> universal,
> > binary JSON format specification that offers 1:1 compatibility with JSON.
> >
> > The full spec is here (http://ubjson.org/) and the quick list of types
> is
> > here
> > (http://ubjson.org/type-reference/). Differences with existing specs and
> > "Why" are
> > all addressed on the site in the first few sections.
> >
> > The goal of the specification was first to maintain 1:1 compatibility
> with
> > JSON
> > (no custom data structures - like what caused BSON to be rejected in
> Issue
> > #702),
> > secondly to be as simple to work with as regular JSON (no complex data
> > structures or
> > encoding/decoding algorithms to implement) and lastly, it had to be
> smaller
> > than
> > compacted JSON and faster to generate and parse.
> >
> > Using a test doc that I see Filipe reference in a few of his issues
> > (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
> > compression:
> >
> > * Compacted JSON: 3,861 bytes
> > * Univ. Binary JSON: 3,056 bytes (20% smaller)
> >
> > In some other sample data (e.g. jvm-serializers sample data) I see a 27%
> > compression
> > with a typical compression range of 20-30%.
> >
> > While these compression levels are average, the data is written out in an
> > unmolested
> > format that is optimized for read speed (no scanning for null
> terminators)
> > and criminally
> > simple to work with. (win-win)
> >
> > I added more clarifying information about compression characteristics in
> the
> > "Size Requirements"
> > section of the spec for anyone interested.
> >
> >
> > Motivation
> > ======================
> > I've been following the discussions surround a native binary JSON format
> for
> > the core
> > CouchDB file (Issue #1092) which transformed into keeping the format and
> > utilizing
> > Google's Snappy (Issue #1120) to provide what looks to be roughly a
> 40-50%
> > reduction in file
> > size at the cost of running the compression/decompression on every
> > read/write.
> >
> > I realize in light of the HTTP transport and JSON encoding/decoding cycle
> in
> > CouchDB, the
> > Snappy compression cycles are a very small part of the total time the
> server
> > spends working.
> >
> > I found this all interesting, but like I said, I realized up to this
> point
> > that Snappy
> > wasn't any form of bottleneck and the big compression wins server side
> were
> > great so I had
> > nothing to contribute to the conversation.
> >
> >
> > Catalyst
> > ======================
> > This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD
> )
> > and started to
> > foam at the mouth when I saw his slides where he skipped the JSON
> > encode/decode cycle
> > server-side and just generated straight from binary on disk into
> MessagePack
> > and got
> > some phenomenal speedups from the server:
> > http://i.imgscalr.com/XKqXiLusT.png
> >
> > I pinged Tim to see what the chances of adding Univ Binary JSON support
> was
> > and he seemed
> > ameanable to the idea as long as I could hand him an Erlang or Ruby impl
> > (unfortunately,
> > I am not familiar with either).
> >
> >
> > ah-HA! moment
> > ======================
> > Today it occurred to me that if CouchDB were able to (at the cost of 20%
> > more disk space
> > than it is using with Snappy enabled, but still 20% *less* than before
> > Snappy was integrated)
> > use the Universal Binary JSON format as its native storage format AND
> > support for serving replies
> > using the same format was added (a-la Tim's work), this would allow
> CouchDB
> > to (theoretically)
> > reply to queries by pulling bytes off disk (or memory) and immediately
> > streaming them back to
> > the caller with no intermediary step at all (no Snappy decompress, no
> Erlang
> > decode, no JSON encode).
> >
> > Given that the Univ Binary JSON spec is standard, easy to parse and
> simple
> > to convert back to
> > JSON, adding support for it seemed more consistent with Couch's motto of
> > ease and simplicity
> > than say MessagePack or Protobuff which provide better compression but at
> > the cost of more
> > complex formats and data types that have no ancillary in JSON.
> >
> > I don't know the intracacies of Couch's internals, if that is wrong and
> some
> > Erlang
> > manipulation of the data would still be required, I believe it would
> still
> > be faster to pull the data
> > off disk in the Univ Binary JSON format, decode to Erlang native types
> and
> > then reply while
> > skipping the Snappy decompression step.
> >
> > If it *would* be possible to stream it back un-touched directly from
> disk,
> > that seems like
> > an enhancement that could potentially speed up CouchDB by at least an
> order
> > of magnitude.
> >
> >
> > Conclusion
> > =======================
> > I would appreciate any feedback on this idea from you guys with a lot
> more
> > knowledge of
> > the internals.
> >
> > I have no problem if this is a horrible idea and never going to happen, I
> > just wanted to try
> > and contribute something back.
> >
> >
> > Thank you all for reading.
> >
> > Best wishes,
> > Riyad
> >
>
> what is universal in something new?
>
> -  benoit
>

Re: Universal Binary JSON in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Oct 4, 2011 at 4:43 PM, Riyad Kalla <rk...@gmail.com> wrote:
> Tim's work is certainly the catalyst for my excitement about the potential
> here for Couch.
>
> As Paul pointed out, the correct discussion to have at this point is really
> about "do we support a binary format for responses" and if so "which
> one"? That discussion could go on for an eternity with everyone voting for
> their favorite (protobuff, smile, messagepack, etc.).
>
> The only reason I bring up the "disk store format" discussion into this
> conversion is to offer a hat-tip to a future where a binary response format
> selected now may dovetail nicely with alternative binary disk formats,
> enabling the stream-directly-from-disk scenario.
>
> If we were to hypothetically remove the possibility of the on-disk format
> ever changing, then I suppose the decision of binary response format just
> becomes an issue of "Which one is fast and easy to generate?".
>

Quite right though I think you skipped a step which is "Are we capable
of even generating alternate content-types?"

I think we are with a minor amount of work. And I've tried to make
suggestions that would allow for a fairly trivial switch (which have
generally been accepted, minus that hugenum issue ;)

Things get more complicated, if say, (flame suit donned) we decided to
provide an XML response type. The issue here becomes less of a "can we
write code that produces XML" to a "how do we represent this with XML"
type bike shedding issue.

Here Ubjson is so close to the JSON format that just merely producing
it in all responses is the technical challenge. Once someone shows
that's not insane, then there's a discussion to be had for moving
forward.

Or something.

Re: Universal Binary JSON in CouchDB

Posted by Robert Newson <rn...@apache.org>.
Ah, that's easier. If the question is "do we support a binary format
for responses" I'd vote "no".

That doesn't mean we shouldn't improve the speed of what we have, though.

B.

On 4 October 2011 22:43, Riyad Kalla <rk...@gmail.com> wrote:
> Tim's work is certainly the catalyst for my excitement about the potential
> here for Couch.
>
> As Paul pointed out, the correct discussion to have at this point is really
> about "do we support a binary format for responses" and if so "which
> one"? That discussion could go on for an eternity with everyone voting for
> their favorite (protobuff, smile, messagepack, etc.).
>
> The only reason I bring up the "disk store format" discussion into this
> conversion is to offer a hat-tip to a future where a binary response format
> selected now may dovetail nicely with alternative binary disk formats,
> enabling the stream-directly-from-disk scenario.
>
> If we were to hypothetically remove the possibility of the on-disk format
> ever changing, then I suppose the decision of binary response format just
> becomes an issue of "Which one is fast and easy to generate?".
>
> -R
>
> On Tue, Oct 4, 2011 at 12:49 PM, Ladislav Thon <la...@gmail.com> wrote:
>
>> >
>> > That said, the ubjson spec is starting to look reasonable and capable
>> > to be an alternative content-type produced by CouchDB. If someone were
>> > to write a patch I'd review it quite enthusiastically.
>>
>>
>> Just FYI, there's an experimental support for MessagePack by Tim Anglade:
>> https://github.com/timanglade/couchdb/commits/msgpack I thought it might
>> be
>> interesting in this debate... Tim says it improves performance quite a bit:
>> http://blog.cloudant.com/optimizing-couchdb-calls-by-99-percent/ (Tim, if
>> you're reading this, thank's for the excellent talk!)
>>
>> LT
>>
>

Re: Universal Binary JSON in CouchDB

Posted by Riyad Kalla <rk...@gmail.com>.
Tim's work is certainly the catalyst for my excitement about the potential
here for Couch.

As Paul pointed out, the correct discussion to have at this point is really
about "do we support a binary format for responses" and if so "which
one"? That discussion could go on for an eternity with everyone voting for
their favorite (protobuff, smile, messagepack, etc.).

The only reason I bring up the "disk store format" discussion into this
conversion is to offer a hat-tip to a future where a binary response format
selected now may dovetail nicely with alternative binary disk formats,
enabling the stream-directly-from-disk scenario.

If we were to hypothetically remove the possibility of the on-disk format
ever changing, then I suppose the decision of binary response format just
becomes an issue of "Which one is fast and easy to generate?".

-R

On Tue, Oct 4, 2011 at 12:49 PM, Ladislav Thon <la...@gmail.com> wrote:

> >
> > That said, the ubjson spec is starting to look reasonable and capable
> > to be an alternative content-type produced by CouchDB. If someone were
> > to write a patch I'd review it quite enthusiastically.
>
>
> Just FYI, there's an experimental support for MessagePack by Tim Anglade:
> https://github.com/timanglade/couchdb/commits/msgpack I thought it might
> be
> interesting in this debate... Tim says it improves performance quite a bit:
> http://blog.cloudant.com/optimizing-couchdb-calls-by-99-percent/ (Tim, if
> you're reading this, thank's for the excellent talk!)
>
> LT
>

Re: Universal Binary JSON in CouchDB

Posted by Ladislav Thon <la...@gmail.com>.
>
> That said, the ubjson spec is starting to look reasonable and capable
> to be an alternative content-type produced by CouchDB. If someone were
> to write a patch I'd review it quite enthusiastically.


Just FYI, there's an experimental support for MessagePack by Tim Anglade:
https://github.com/timanglade/couchdb/commits/msgpack I thought it might be
interesting in this debate... Tim says it improves performance quite a bit:
http://blog.cloudant.com/optimizing-couchdb-calls-by-99-percent/ (Tim, if
you're reading this, thank's for the excellent talk!)

LT

Re: Universal Binary JSON in CouchDB

Posted by Benoit Chesneau <bc...@gmail.com>.
On Wed, Oct 5, 2011 at 1:34 AM, Paul Davis <pa...@gmail.com> wrote:
> On Tue, Oct 4, 2011 at 3:08 PM, Benoit Chesneau <bc...@gmail.com> wrote:
>> On Tue, Oct 4, 2011 at 9:33 PM, Paul Davis <pa...@gmail.com> wrote:
>>> For a first step I'd prefer to see a patch that makes the HTTP
>>> responses choose a content type based on accept headers. Once we see
>>> what that looks like and how/if it changes performance then *maybe* we
>>> can start talking about on disk formats. Changing how we store things
>>> on disk is a fairly high impact change that we'll need to consider
>>> carefully.
>>
>> +1
>>>
>>> That said, the ubjson spec is starting to look reasonable and capable
>>> to be an alternative content-type produced by CouchDB. If someone were
>>> to write a patch I'd review it quite enthusiastically.
>>>
>>>
>>
>> I think I would prefer to use protobuffs format though. Anyway if wwe
>> change the api to handle all types that would be pluggable without
>> problem.
>>
>> - benoît
>>
>
> I think you nailed it Benoit. First step, see if we can remove the
> JSON specific stuff (we build JSON strings by hand in some places)
> with an eye on keeping it generic. Then we can start thinking about
> how to make it reasonable for pluggin any type of encoder/decoder
> pair.
>

totally :) mea culpa.

Anyway I think that in order I would prefer to work on a more flexible
TCP/HTTP layer that would allows us to provide correct HTTP response
and handle other types (with accept headers). I'm looking more and
more in cowboy for that. The second part, accepting format would work
I think if we can provide a way to standardize the way we save docs
and such on the disk. For example we could reuse our object
representation. {[]} etc, sound like it would be feasible. Then we
wouldn't have to change the representation on the disk, neither the
disk format But right, step by step.

- benoît

Re: Universal Binary JSON in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Oct 4, 2011 at 3:08 PM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Tue, Oct 4, 2011 at 9:33 PM, Paul Davis <pa...@gmail.com> wrote:
>> For a first step I'd prefer to see a patch that makes the HTTP
>> responses choose a content type based on accept headers. Once we see
>> what that looks like and how/if it changes performance then *maybe* we
>> can start talking about on disk formats. Changing how we store things
>> on disk is a fairly high impact change that we'll need to consider
>> carefully.
>
> +1
>>
>> That said, the ubjson spec is starting to look reasonable and capable
>> to be an alternative content-type produced by CouchDB. If someone were
>> to write a patch I'd review it quite enthusiastically.
>>
>>
>
> I think I would prefer to use protobuffs format though. Anyway if wwe
> change the api to handle all types that would be pluggable without
> problem.
>
> - benoît
>

I think you nailed it Benoit. First step, see if we can remove the
JSON specific stuff (we build JSON strings by hand in some places)
with an eye on keeping it generic. Then we can start thinking about
how to make it reasonable for pluggin any type of encoder/decoder
pair.

Re: Universal Binary JSON in CouchDB

Posted by Benoit Chesneau <bc...@gmail.com>.
On Tue, Oct 4, 2011 at 10:18 PM, Robert Newson <rn...@apache.org> wrote:

>
> Bottom line: I'd focus on optimizing the JSON encode/decode layer
> first before considering anything as dramatic as this. Paul Davis
> wrote a very fast JSON encoder/decoder called 'jiffy'. I would like to
> hear more about that.
>
there is a ticket waiting feedback for that ;)

- benoît

Re: Universal Binary JSON in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Oct 4, 2011 at 4:24 PM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Tue, Oct 4, 2011 at 10:18 PM, Robert Newson <rn...@apache.org> wrote:
>> -1
>>
>> Supporting multiple formats on disk would be a very difficult code
>> change that would complicate every part of the system, I don't think
>> it's worth it.
>>
>>
>
> The real problem I see is maintening different format with views and
> replication, t will indeed complicate things.  There are also oher
> tricks that can be used on http/tcp level to speed things. Like
> supporting websockets or other things like that.
>
>
> - benoit
>

Yeah, it gets interesting quickly. But that's why I think we should
have the first step of basing things off the accept header for
request/response bodies only. *If* something other than JSON turns
into a noticeable improvement, then we *may* add a replication thing
to use it. But promising to do so right out is, I think, overreaching
a bit.

Plus, if we can't manage to have accept based content-type switching,
why should we think we'd be any better at supporting it internally?

Re: Universal Binary JSON in CouchDB

Posted by Benoit Chesneau <bc...@gmail.com>.
On Tue, Oct 4, 2011 at 10:18 PM, Robert Newson <rn...@apache.org> wrote:
> -1
>
> Supporting multiple formats on disk would be a very difficult code
> change that would complicate every part of the system, I don't think
> it's worth it.
>
>

The real problem I see is maintening different format with views and
replication, t will indeed complicate things.  There are also oher
tricks that can be used on http/tcp level to speed things. Like
supporting websockets or other things like that.


- benoit

Re: Universal Binary JSON in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Oct 4, 2011 at 3:18 PM, Robert Newson <rn...@apache.org> wrote:
> -1
>

Such a Debbie Downer.

> Supporting multiple formats on disk would be a very difficult code
> change that would complicate every part of the system, I don't think
> it's worth it.
>

Its not necessarily multiple formats, just one that we might be able
to serve (almost) directly to clients. Obviously this is hard and has
quite a few caveats if we decide to change away from Erlang's external
term format. But as it is, ubjson is basically the same thing as
Erlang's external term format just not Erlang specific.

If there's a possibility of it making a difference I see no reason to
investigate it. But I maintain that such a change would be quite large
and impact a large portion of the code base. So if there is a change
to be proposed someone will have to champion it, write it, test it,
and then convince everyone else that its worth it.

> If we were to contemplate just multiple http payload formats, I would
> rather support one with broader acceptance (and with the caveat that
> it would have to have some compelling reason beyond being just another
> format). I'm aware of Tim's work on messagepack but I believe it's run
> aground for the technical reasons I alluded to above.
>

Not sure what you point the allusion was too. MessagePack is nice but
lacks some features that would be required by behaviors for CouchDB.
Only because Tim suggested MessagePack did I know to suggest things
like a noop type and unbounded container lengths.

> Bottom line: I'd focus on optimizing the JSON encode/decode layer
> first before considering anything as dramatic as this. Paul Davis
> wrote a very fast JSON encoder/decoder called 'jiffy'. I would like to
> hear more about that.
>

I have. I think I have a very subtle bug cause I saw a single segfault
once so I haven't pushed to hard on getting it into trunk before other
people test it.

I think this goes back to Tim's talk though and my initial reaction to
MessagePack. I'm sure that its probably faster and is definitely
smaller than the corresponding JSON. And I can probably show that by
writing hand optimized encoder/decoder pairs for both. The issue is
that we can't support an encoder for every client language. So if
there's a reasonable spec that makes it easier for Ada or BrainFuck to
parse more efficiently and doesn't upset the internals too greatly,
then I see no reason to investigate.

> B.
>
> On 4 October 2011 21:08, Benoit Chesneau <bc...@gmail.com> wrote:
>> On Tue, Oct 4, 2011 at 9:33 PM, Paul Davis <pa...@gmail.com> wrote:
>>> For a first step I'd prefer to see a patch that makes the HTTP
>>> responses choose a content type based on accept headers. Once we see
>>> what that looks like and how/if it changes performance then *maybe* we
>>> can start talking about on disk formats. Changing how we store things
>>> on disk is a fairly high impact change that we'll need to consider
>>> carefully.
>>
>> +1
>>>
>>> That said, the ubjson spec is starting to look reasonable and capable
>>> to be an alternative content-type produced by CouchDB. If someone were
>>> to write a patch I'd review it quite enthusiastically.
>>>
>>>
>>
>> I think I would prefer to use protobuffs format though. Anyway if wwe
>> change the api to handle all types that would be pluggable without
>> problem.
>>
>> - benoît
>>
>

Re: Universal Binary JSON in CouchDB

Posted by Robert Newson <rn...@apache.org>.
-1

Supporting multiple formats on disk would be a very difficult code
change that would complicate every part of the system, I don't think
it's worth it.

If we were to contemplate just multiple http payload formats, I would
rather support one with broader acceptance (and with the caveat that
it would have to have some compelling reason beyond being just another
format). I'm aware of Tim's work on messagepack but I believe it's run
aground for the technical reasons I alluded to above.

Bottom line: I'd focus on optimizing the JSON encode/decode layer
first before considering anything as dramatic as this. Paul Davis
wrote a very fast JSON encoder/decoder called 'jiffy'. I would like to
hear more about that.

B.

On 4 October 2011 21:08, Benoit Chesneau <bc...@gmail.com> wrote:
> On Tue, Oct 4, 2011 at 9:33 PM, Paul Davis <pa...@gmail.com> wrote:
>> For a first step I'd prefer to see a patch that makes the HTTP
>> responses choose a content type based on accept headers. Once we see
>> what that looks like and how/if it changes performance then *maybe* we
>> can start talking about on disk formats. Changing how we store things
>> on disk is a fairly high impact change that we'll need to consider
>> carefully.
>
> +1
>>
>> That said, the ubjson spec is starting to look reasonable and capable
>> to be an alternative content-type produced by CouchDB. If someone were
>> to write a patch I'd review it quite enthusiastically.
>>
>>
>
> I think I would prefer to use protobuffs format though. Anyway if wwe
> change the api to handle all types that would be pluggable without
> problem.
>
> - benoît
>

Re: Universal Binary JSON in CouchDB

Posted by Benoit Chesneau <bc...@gmail.com>.
On Tue, Oct 4, 2011 at 9:33 PM, Paul Davis <pa...@gmail.com> wrote:
> For a first step I'd prefer to see a patch that makes the HTTP
> responses choose a content type based on accept headers. Once we see
> what that looks like and how/if it changes performance then *maybe* we
> can start talking about on disk formats. Changing how we store things
> on disk is a fairly high impact change that we'll need to consider
> carefully.

+1
>
> That said, the ubjson spec is starting to look reasonable and capable
> to be an alternative content-type produced by CouchDB. If someone were
> to write a patch I'd review it quite enthusiastically.
>
>

I think I would prefer to use protobuffs format though. Anyway if wwe
change the api to handle all types that would be pluggable without
problem.

- benoît

Re: Universal Binary JSON in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
For a first step I'd prefer to see a patch that makes the HTTP
responses choose a content type based on accept headers. Once we see
what that looks like and how/if it changes performance then *maybe* we
can start talking about on disk formats. Changing how we store things
on disk is a fairly high impact change that we'll need to consider
carefully.

That said, the ubjson spec is starting to look reasonable and capable
to be an alternative content-type produced by CouchDB. If someone were
to write a patch I'd review it quite enthusiastically.


On Tue, Oct 4, 2011 at 2:23 PM, Riyad Kalla <rk...@gmail.com> wrote:
> Hey Randall,
>
> This is something that Paul and I discussed on IRC. The way UBJ is written
> out looks something like this ([] blocks are just for readability):
> [o][2]
>  [s][4]name[s][3][bob]
>  [s][3][age][i][31]
>
> Couch can easily prepend or append its own dynamic content in a reply. If it
> wants to prepend some information after the object header, the header would
> need to be stored and manipulated by couch separately.
>
> For example, if I upload the doc above, Couch would want to take that root
> object header of:
> [o][2]
>
> and change it to:
> [o][4]
>
> before storing it because of the additions of _id and _rev. Actually this
> could be as simple as storing a "rootObjectCount" and have couch dynamically
> generate the root every time.
>
> 'o' represents object containers with <= 254 elements (1 byte for length)
> and 'O' represents object containers with up to 2.1 billion elements (4 byte
> int).
>
> If couch did that any request coming into the server might look like this:
> <- client request
> -- (server loads root object count)
> -> server writes back object header: [o][4]
> -- (server calculates dynamic data)
> -> server writes back dynamic content
> -> server streams raw record data straight off disk to client (no
> deserialization)
> -- (OPT: server calculates dynamic data)
> --> OPT: server streams dynamic data appended
>
> Thoughts?
>
> Best,
> Riyad
>
> P.S.> There is support in the spec for unbounded container types when couch
> doesn't know how much it is streaming back, but that isn't necessary for
> retrieving stored docs (but could be handy when responding to view queries
> and other requests whose length is not known in advance)
>
> On Tue, Oct 4, 2011 at 12:02 PM, Randall Leeds <ra...@gmail.com>wrote:
>
>> Hey,
>>
>> Thanks for this thread.
>>
>> I've been interested in ways to reduce the work from disk to client as
>> well.
>> Unfortunately, the metadata inside the document objects is variable based
>> on
>> query parameters (_attachments, _revisions, _revs_info...) so the server
>> needs to decode the disk binary anyway.
>>
>> I would say this is something we should carefully consider for a 2.0 api. I
>> know that, for simplicity, many people really like having the underscore
>> prefixed attributes mixed in right alongside the document data, but a
>> future
>> API that separated these could really make things fly.
>>
>> -Randall
>>
>> On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau <bc...@gmail.com>
>> wrote:
>>
>> > On Thursday, September 29, 2011, Riyad Kalla <rk...@gmail.com> wrote:
>> > > DISCLAIMER: This looks long, but reads quickly (I hope). If you are in
>> a
>> > > rush,
>> > > just check the last 2 sections and see if it sounds interesting.
>> > >
>> > >
>> > > Hi everybody. I am new to the list, but a big fan of Couch and I have
>> > been
>> > > working
>> > > on something I wanted to share with the group.
>> > >
>> > > My appologies if this isn't the right venue or list ediquette... I
>> wasn't
>> > > really
>> > > sure where to start with this conversation.
>> > >
>> > >
>> > > Background
>> > > =====================
>> > > With the help of the JSON spec community I've been finalizing a
>> > universal,
>> > > binary JSON format specification that offers 1:1 compatibility with
>> JSON.
>> > >
>> > > The full spec is here (http://ubjson.org/) and the quick list of types
>> > is
>> > > here
>> > > (http://ubjson.org/type-reference/). Differences with existing specs
>> and
>> > > "Why" are
>> > > all addressed on the site in the first few sections.
>> > >
>> > > The goal of the specification was first to maintain 1:1 compatibility
>> > with
>> > > JSON
>> > > (no custom data structures - like what caused BSON to be rejected in
>> > Issue
>> > > #702),
>> > > secondly to be as simple to work with as regular JSON (no complex data
>> > > structures or
>> > > encoding/decoding algorithms to implement) and lastly, it had to be
>> > smaller
>> > > than
>> > > compacted JSON and faster to generate and parse.
>> > >
>> > > Using a test doc that I see Filipe reference in a few of his issues
>> > > (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
>> > > compression:
>> > >
>> > > * Compacted JSON: 3,861 bytes
>> > > * Univ. Binary JSON: 3,056 bytes (20% smaller)
>> > >
>> > > In some other sample data (e.g. jvm-serializers sample data) I see a
>> 27%
>> > > compression
>> > > with a typical compression range of 20-30%.
>> > >
>> > > While these compression levels are average, the data is written out in
>> an
>> > > unmolested
>> > > format that is optimized for read speed (no scanning for null
>> > terminators)
>> > > and criminally
>> > > simple to work with. (win-win)
>> > >
>> > > I added more clarifying information about compression characteristics
>> in
>> > the
>> > > "Size Requirements"
>> > > section of the spec for anyone interested.
>> > >
>> > >
>> > > Motivation
>> > > ======================
>> > > I've been following the discussions surround a native binary JSON
>> format
>> > for
>> > > the core
>> > > CouchDB file (Issue #1092) which transformed into keeping the format
>> and
>> > > utilizing
>> > > Google's Snappy (Issue #1120) to provide what looks to be roughly a
>> > 40-50%
>> > > reduction in file
>> > > size at the cost of running the compression/decompression on every
>> > > read/write.
>> > >
>> > > I realize in light of the HTTP transport and JSON encoding/decoding
>> cycle
>> > in
>> > > CouchDB, the
>> > > Snappy compression cycles are a very small part of the total time the
>> > server
>> > > spends working.
>> > >
>> > > I found this all interesting, but like I said, I realized up to this
>> > point
>> > > that Snappy
>> > > wasn't any form of bottleneck and the big compression wins server side
>> > were
>> > > great so I had
>> > > nothing to contribute to the conversation.
>> > >
>> > >
>> > > Catalyst
>> > > ======================
>> > > This past week I watched Tim Anglade's presentation (
>> http://goo.gl/LLucD
>> > )
>> > > and started to
>> > > foam at the mouth when I saw his slides where he skipped the JSON
>> > > encode/decode cycle
>> > > server-side and just generated straight from binary on disk into
>> > MessagePack
>> > > and got
>> > > some phenomenal speedups from the server:
>> > > http://i.imgscalr.com/XKqXiLusT.png
>> > >
>> > > I pinged Tim to see what the chances of adding Univ Binary JSON support
>> > was
>> > > and he seemed
>> > > ameanable to the idea as long as I could hand him an Erlang or Ruby
>> impl
>> > > (unfortunately,
>> > > I am not familiar with either).
>> > >
>> > >
>> > > ah-HA! moment
>> > > ======================
>> > > Today it occurred to me that if CouchDB were able to (at the cost of
>> 20%
>> > > more disk space
>> > > than it is using with Snappy enabled, but still 20% *less* than before
>> > > Snappy was integrated)
>> > > use the Universal Binary JSON format as its native storage format AND
>> > > support for serving replies
>> > > using the same format was added (a-la Tim's work), this would allow
>> > CouchDB
>> > > to (theoretically)
>> > > reply to queries by pulling bytes off disk (or memory) and immediately
>> > > streaming them back to
>> > > the caller with no intermediary step at all (no Snappy decompress, no
>> > Erlang
>> > > decode, no JSON encode).
>> > >
>> > > Given that the Univ Binary JSON spec is standard, easy to parse and
>> > simple
>> > > to convert back to
>> > > JSON, adding support for it seemed more consistent with Couch's motto
>> of
>> > > ease and simplicity
>> > > than say MessagePack or Protobuff which provide better compression but
>> at
>> > > the cost of more
>> > > complex formats and data types that have no ancillary in JSON.
>> > >
>> > > I don't know the intracacies of Couch's internals, if that is wrong and
>> > some
>> > > Erlang
>> > > manipulation of the data would still be required, I believe it would
>> > still
>> > > be faster to pull the data
>> > > off disk in the Univ Binary JSON format, decode to Erlang native types
>> > and
>> > > then reply while
>> > > skipping the Snappy decompression step.
>> > >
>> > > If it *would* be possible to stream it back un-touched directly from
>> > disk,
>> > > that seems like
>> > > an enhancement that could potentially speed up CouchDB by at least an
>> > order
>> > > of magnitude.
>> > >
>> > >
>> > > Conclusion
>> > > =======================
>> > > I would appreciate any feedback on this idea from you guys with a lot
>> > more
>> > > knowledge of
>> > > the internals.
>> > >
>> > > I have no problem if this is a horrible idea and never going to happen,
>> I
>> > > just wanted to try
>> > > and contribute something back.
>> > >
>> > >
>> > > Thank you all for reading.
>> > >
>> > > Best wishes,
>> > > Riyad
>> > >
>> >
>> > what is universal in something new?
>> >
>> > -  benoit
>> >
>>
>

Re: Universal Binary JSON in CouchDB

Posted by Riyad Kalla <rk...@gmail.com>.
Hey Randall,

This is something that Paul and I discussed on IRC. The way UBJ is written
out looks something like this ([] blocks are just for readability):
[o][2]
  [s][4]name[s][3][bob]
  [s][3][age][i][31]

Couch can easily prepend or append its own dynamic content in a reply. If it
wants to prepend some information after the object header, the header would
need to be stored and manipulated by couch separately.

For example, if I upload the doc above, Couch would want to take that root
object header of:
[o][2]

and change it to:
[o][4]

before storing it because of the additions of _id and _rev. Actually this
could be as simple as storing a "rootObjectCount" and have couch dynamically
generate the root every time.

'o' represents object containers with <= 254 elements (1 byte for length)
and 'O' represents object containers with up to 2.1 billion elements (4 byte
int).

If couch did that any request coming into the server might look like this:
<- client request
-- (server loads root object count)
-> server writes back object header: [o][4]
-- (server calculates dynamic data)
-> server writes back dynamic content
-> server streams raw record data straight off disk to client (no
deserialization)
-- (OPT: server calculates dynamic data)
--> OPT: server streams dynamic data appended

Thoughts?

Best,
Riyad

P.S.> There is support in the spec for unbounded container types when couch
doesn't know how much it is streaming back, but that isn't necessary for
retrieving stored docs (but could be handy when responding to view queries
and other requests whose length is not known in advance)

On Tue, Oct 4, 2011 at 12:02 PM, Randall Leeds <ra...@gmail.com>wrote:

> Hey,
>
> Thanks for this thread.
>
> I've been interested in ways to reduce the work from disk to client as
> well.
> Unfortunately, the metadata inside the document objects is variable based
> on
> query parameters (_attachments, _revisions, _revs_info...) so the server
> needs to decode the disk binary anyway.
>
> I would say this is something we should carefully consider for a 2.0 api. I
> know that, for simplicity, many people really like having the underscore
> prefixed attributes mixed in right alongside the document data, but a
> future
> API that separated these could really make things fly.
>
> -Randall
>
> On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau <bc...@gmail.com>
> wrote:
>
> > On Thursday, September 29, 2011, Riyad Kalla <rk...@gmail.com> wrote:
> > > DISCLAIMER: This looks long, but reads quickly (I hope). If you are in
> a
> > > rush,
> > > just check the last 2 sections and see if it sounds interesting.
> > >
> > >
> > > Hi everybody. I am new to the list, but a big fan of Couch and I have
> > been
> > > working
> > > on something I wanted to share with the group.
> > >
> > > My appologies if this isn't the right venue or list ediquette... I
> wasn't
> > > really
> > > sure where to start with this conversation.
> > >
> > >
> > > Background
> > > =====================
> > > With the help of the JSON spec community I've been finalizing a
> > universal,
> > > binary JSON format specification that offers 1:1 compatibility with
> JSON.
> > >
> > > The full spec is here (http://ubjson.org/) and the quick list of types
> > is
> > > here
> > > (http://ubjson.org/type-reference/). Differences with existing specs
> and
> > > "Why" are
> > > all addressed on the site in the first few sections.
> > >
> > > The goal of the specification was first to maintain 1:1 compatibility
> > with
> > > JSON
> > > (no custom data structures - like what caused BSON to be rejected in
> > Issue
> > > #702),
> > > secondly to be as simple to work with as regular JSON (no complex data
> > > structures or
> > > encoding/decoding algorithms to implement) and lastly, it had to be
> > smaller
> > > than
> > > compacted JSON and faster to generate and parse.
> > >
> > > Using a test doc that I see Filipe reference in a few of his issues
> > > (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
> > > compression:
> > >
> > > * Compacted JSON: 3,861 bytes
> > > * Univ. Binary JSON: 3,056 bytes (20% smaller)
> > >
> > > In some other sample data (e.g. jvm-serializers sample data) I see a
> 27%
> > > compression
> > > with a typical compression range of 20-30%.
> > >
> > > While these compression levels are average, the data is written out in
> an
> > > unmolested
> > > format that is optimized for read speed (no scanning for null
> > terminators)
> > > and criminally
> > > simple to work with. (win-win)
> > >
> > > I added more clarifying information about compression characteristics
> in
> > the
> > > "Size Requirements"
> > > section of the spec for anyone interested.
> > >
> > >
> > > Motivation
> > > ======================
> > > I've been following the discussions surround a native binary JSON
> format
> > for
> > > the core
> > > CouchDB file (Issue #1092) which transformed into keeping the format
> and
> > > utilizing
> > > Google's Snappy (Issue #1120) to provide what looks to be roughly a
> > 40-50%
> > > reduction in file
> > > size at the cost of running the compression/decompression on every
> > > read/write.
> > >
> > > I realize in light of the HTTP transport and JSON encoding/decoding
> cycle
> > in
> > > CouchDB, the
> > > Snappy compression cycles are a very small part of the total time the
> > server
> > > spends working.
> > >
> > > I found this all interesting, but like I said, I realized up to this
> > point
> > > that Snappy
> > > wasn't any form of bottleneck and the big compression wins server side
> > were
> > > great so I had
> > > nothing to contribute to the conversation.
> > >
> > >
> > > Catalyst
> > > ======================
> > > This past week I watched Tim Anglade's presentation (
> http://goo.gl/LLucD
> > )
> > > and started to
> > > foam at the mouth when I saw his slides where he skipped the JSON
> > > encode/decode cycle
> > > server-side and just generated straight from binary on disk into
> > MessagePack
> > > and got
> > > some phenomenal speedups from the server:
> > > http://i.imgscalr.com/XKqXiLusT.png
> > >
> > > I pinged Tim to see what the chances of adding Univ Binary JSON support
> > was
> > > and he seemed
> > > ameanable to the idea as long as I could hand him an Erlang or Ruby
> impl
> > > (unfortunately,
> > > I am not familiar with either).
> > >
> > >
> > > ah-HA! moment
> > > ======================
> > > Today it occurred to me that if CouchDB were able to (at the cost of
> 20%
> > > more disk space
> > > than it is using with Snappy enabled, but still 20% *less* than before
> > > Snappy was integrated)
> > > use the Universal Binary JSON format as its native storage format AND
> > > support for serving replies
> > > using the same format was added (a-la Tim's work), this would allow
> > CouchDB
> > > to (theoretically)
> > > reply to queries by pulling bytes off disk (or memory) and immediately
> > > streaming them back to
> > > the caller with no intermediary step at all (no Snappy decompress, no
> > Erlang
> > > decode, no JSON encode).
> > >
> > > Given that the Univ Binary JSON spec is standard, easy to parse and
> > simple
> > > to convert back to
> > > JSON, adding support for it seemed more consistent with Couch's motto
> of
> > > ease and simplicity
> > > than say MessagePack or Protobuff which provide better compression but
> at
> > > the cost of more
> > > complex formats and data types that have no ancillary in JSON.
> > >
> > > I don't know the intracacies of Couch's internals, if that is wrong and
> > some
> > > Erlang
> > > manipulation of the data would still be required, I believe it would
> > still
> > > be faster to pull the data
> > > off disk in the Univ Binary JSON format, decode to Erlang native types
> > and
> > > then reply while
> > > skipping the Snappy decompression step.
> > >
> > > If it *would* be possible to stream it back un-touched directly from
> > disk,
> > > that seems like
> > > an enhancement that could potentially speed up CouchDB by at least an
> > order
> > > of magnitude.
> > >
> > >
> > > Conclusion
> > > =======================
> > > I would appreciate any feedback on this idea from you guys with a lot
> > more
> > > knowledge of
> > > the internals.
> > >
> > > I have no problem if this is a horrible idea and never going to happen,
> I
> > > just wanted to try
> > > and contribute something back.
> > >
> > >
> > > Thank you all for reading.
> > >
> > > Best wishes,
> > > Riyad
> > >
> >
> > what is universal in something new?
> >
> > -  benoit
> >
>