You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2019/05/17 02:28:39 UTC

Numbers in JavaScript, Lucene, and FoundationDB

Hi all, CouchDB has always had a somewhat complicated relationship with numbers. I’d like to dig into that a little bit and see if any changes are warranted, or if we can at least be really clear about exactly how they’re handled going forward.

Most of you are likely aware that JS represents *all* numbers as IEEE 754 double precision floats. This means that any number in a JSON document with more than 15 significant digits is at risk of being corrupted when it passes through the JS engine during a view build, for example. Our current behavior is to let that silent corruption occur and put whatever number comes out of the JS engine into the view, formatting as a double, int64, or bignum based on jiffy’s decoding of the JSON output from the JS code.

On the other hand, FoundationDB’s tuple layer encoding is quite a bit more specific. It has a whole bunch of typecodes for integers of practically arbitrary size (up to 255 bytes), along with codes for 32 bit and 64 bit floating point numbers. The typecodes control the sorting; i.e., integers sort separately from floats.

We also have the ever-popular Lucene indexes for folks who build CouchDB with the search extension. I don’t have all the details for the number handling in that one handy, but it is another one to keep in mind.

One question that comes up fairly quickly — when a user emits a number as a key in a view, what do we store in FoundationDB? In order to respect CouchDB’s existing collation rules we need to use the same typecode for all numbers. Do we simply treat every number as a double, since they were all coerced into that representation anyway in JS?

But now let’s consider Mango indexes, which don’t suffer from any of JavaScript’s sloppiness around number handling. If we’re to respect CouchDB’s current collation rules we still need a common typecode and sortable binary representation across integers and floats. Do we end up using the IEEE 754 float representation of each number as a “sort key” and storing the original number alongside it?

I feel like this ends up being a rabbit hole, but one where we owe it to our users to thoroughly explore and produce a definitive guide :)

Cheers, Adam

Re: Numbers in JavaScript, Lucene, and FoundationDB

Posted by Garren Smith <ga...@apache.org>.

On Fri, May 17, 2019 at 6:04 AM Paul Davis <pa...@gmail.com>
wrote:

> Its late so just a few quick notes here:
>
> Jiffy decodes numbers based on their encoding. I.e., any number that
> includes a decimal point or exponent is decoded as a double while any
> integer is decoded as an integer or bignum depending on size. While
> encoding jiffy will also encode 1.0 as "1.0" and 1 as "1". Generally
> speaking this seems to be the least surprising behavior for users.
>
> That said, one particular aspect of JSON and numbers in particular has
> always been around money math. Things like "$1 / 3" follow a different
> set of rules than arbitrary floating point arithmetic. CouchDB has a
> long history of telling users that numbers mostly behave like doubles
> given our JavaScript default. Given that, I would expect anyone that
> needs a JSON oriented database that has fancy numerical needs to
> already be paying special attention to their numeric data.
>
> The FoundationDB collation does definitely present new questions given
> that we're forced to implement a strict byte ordering. On the face of
> it I'm more than fine forcing everything to doubles and providing the
> mentioned warning label. I do know that FoundationDB's tuple layer has
> some ¯\_(ツ)_/¯ semantics for "invalid" doubles (-Nan, Nan, -0, other
> oddities I'd never heard of). So there may be caveats to mention there
> as well. However, for the most part I'd our standard reply of "if you
> care about your numbers to the actual bit representation level, use a
> string representation" is while maybe not officially official, still
> the best advice given JSON.
>
> That of course ignores the fact that `emit(1, 2)` returns a view row
> of `("1.0", "2.0")` which Adam noted as another whole big thing. On
> that I don't have any amazing thoughts this late at night.
>

To get around the ("1.0", "2.0"), we could look at encoding the keys to get
the correct collation in FDB but then also storing the unencoded keys to
return to the user. We could possible store the keys in the value but that
then reduces the amount of map values that can be stored or as a separate
row in FDB.  This would fix this problem and also help with storing any
strings for a key.



On Thu, May 16, 2019 at 9:39 PM Adam Kocoloski <ko...@apache.org> wrote:
> >
> > Hi all, CouchDB has always had a somewhat complicated relationship with
> numbers. I’d like to dig into that a little bit and see if any changes are
> warranted, or if we can at least be really clear about exactly how they’re
> handled going forward.
> >
> > Most of you are likely aware that JS represents *all* numbers as IEEE
> 754 double precision floats. This means that any number in a JSON document
> with more than 15 significant digits is at risk of being corrupted when it
> passes through the JS engine during a view build, for example. Our current
> behavior is to let that silent corruption occur and put whatever number
> comes out of the JS engine into the view, formatting as a double, int64, or
> bignum based on jiffy’s decoding of the JSON output from the JS code.
> >
> > On the other hand, FoundationDB’s tuple layer encoding is quite a bit
> more specific. It has a whole bunch of typecodes for integers of
> practically arbitrary size (up to 255 bytes), along with codes for 32 bit
> and 64 bit floating point numbers. The typecodes control the sorting; i.e.,
> integers sort separately from floats.
> >
> > We also have the ever-popular Lucene indexes for folks who build CouchDB
> with the search extension. I don’t have all the details for the number
> handling in that one handy, but it is another one to keep in mind.
> >
> > One question that comes up fairly quickly — when a user emits a number
> as a key in a view, what do we store in FoundationDB? In order to respect
> CouchDB’s existing collation rules we need to use the same typecode for all
> numbers. Do we simply treat every number as a double, since they were all
> coerced into that representation anyway in JS?
> >
> > But now let’s consider Mango indexes, which don’t suffer from any of
> JavaScript’s sloppiness around number handling. If we’re to respect
> CouchDB’s current collation rules we still need a common typecode and
> sortable binary representation across integers and floats. Do we end up
> using the IEEE 754 float representation of each number as a “sort key” and
> storing the original number alongside it?
> >
> > I feel like this ends up being a rabbit hole, but one where we owe it to
> our users to thoroughly explore and produce a definitive guide :)
> >
> > Cheers, Adam
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Numbers in JavaScript, Lucene, and FoundationDB

Posted by Paul Davis <pa...@gmail.com>.

Its late so just a few quick notes here:

Jiffy decodes numbers based on their encoding. I.e., any number that
includes a decimal point or exponent is decoded as a double while any
integer is decoded as an integer or bignum depending on size. While
encoding jiffy will also encode 1.0 as "1.0" and 1 as "1". Generally
speaking this seems to be the least surprising behavior for users.

That said, one particular aspect of JSON and numbers in particular has
always been around money math. Things like "$1 / 3" follow a different
set of rules than arbitrary floating point arithmetic. CouchDB has a
long history of telling users that numbers mostly behave like doubles
given our JavaScript default. Given that, I would expect anyone that
needs a JSON oriented database that has fancy numerical needs to
already be paying special attention to their numeric data.

The FoundationDB collation does definitely present new questions given
that we're forced to implement a strict byte ordering. On the face of
it I'm more than fine forcing everything to doubles and providing the
mentioned warning label. I do know that FoundationDB's tuple layer has
some ¯\_(ツ)_/¯ semantics for "invalid" doubles (-Nan, Nan, -0, other
oddities I'd never heard of). So there may be caveats to mention there
as well. However, for the most part I'd our standard reply of "if you
care about your numbers to the actual bit representation level, use a
string representation" is while maybe not officially official, still
the best advice given JSON.

That of course ignores the fact that `emit(1, 2)` returns a view row
of `("1.0", "2.0")` which Adam noted as another whole big thing. On
that I don't have any amazing thoughts this late at night.

On Thu, May 16, 2019 at 9:39 PM Adam Kocoloski <ko...@apache.org> wrote:
>
> Hi all, CouchDB has always had a somewhat complicated relationship with numbers. I’d like to dig into that a little bit and see if any changes are warranted, or if we can at least be really clear about exactly how they’re handled going forward.
>
> Most of you are likely aware that JS represents *all* numbers as IEEE 754 double precision floats. This means that any number in a JSON document with more than 15 significant digits is at risk of being corrupted when it passes through the JS engine during a view build, for example. Our current behavior is to let that silent corruption occur and put whatever number comes out of the JS engine into the view, formatting as a double, int64, or bignum based on jiffy’s decoding of the JSON output from the JS code.
>
> On the other hand, FoundationDB’s tuple layer encoding is quite a bit more specific. It has a whole bunch of typecodes for integers of practically arbitrary size (up to 255 bytes), along with codes for 32 bit and 64 bit floating point numbers. The typecodes control the sorting; i.e., integers sort separately from floats.
>
> We also have the ever-popular Lucene indexes for folks who build CouchDB with the search extension. I don’t have all the details for the number handling in that one handy, but it is another one to keep in mind.
>
> One question that comes up fairly quickly — when a user emits a number as a key in a view, what do we store in FoundationDB? In order to respect CouchDB’s existing collation rules we need to use the same typecode for all numbers. Do we simply treat every number as a double, since they were all coerced into that representation anyway in JS?
>
> But now let’s consider Mango indexes, which don’t suffer from any of JavaScript’s sloppiness around number handling. If we’re to respect CouchDB’s current collation rules we still need a common typecode and sortable binary representation across integers and floats. Do we end up using the IEEE 754 float representation of each number as a “sort key” and storing the original number alongside it?
>
> I feel like this ends up being a rabbit hole, but one where we owe it to our users to thoroughly explore and produce a definitive guide :)
>
> Cheers, Adam
>
>
>
>
>
>
>
>
>
>
>
>