You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jason Altekruse <al...@gmail.com> on 2015/01/26 22:36:22 UTC

[DISCUSS] Change default json read behavior for numbers

Hello Drillers,

I am currently working on improving the error reporting in the JSON reader
to help users with files that Drill cannot read using the default
configuration today.

As a part of this change I think it may be useful to change the default
behavior for reading numbers in JSON documents. Currently we fail on a
simple case with reading numbers with decimal points and then hit a value
of 0 (or any number without a decimal point) in a later record. The reason
for the current behavior is to allow better precision in the case of files
with only integers. The issue however is that we currently fail on the
basic case with a mix of intergers and decimal numbers. See [1] for more
discussion on this.

I propose that we switch the JSON reader to read all numbers as doubles by
default. The reader already contains a workaround that allows lossless
casting to integers and decimal types with some extra computational
overhead using all_text_mode, see more info below. [2]

Please share your thoughts on this change.

[1] https://issues.apache.org/jira/browse/DRILL-1460
[2] https://issues.apache.org/jira/browse/DRILL-2071

-Jason

Re: [DISCUSS] Change default json read behavior for numbers

Posted by Jason Altekruse <al...@gmail.com>.

I think there is a reasonable amount of consensus and I am going to
incorporate the proposed change in my JSON changes to improve the error
reporting (better error message on scalar to complex type schema change and
directing users to user all_text_mode for any other schema change i.e.
number->string, bool->number, etc). I do agree that the default should
change once we support embedded type, but for now I think this will provide
a much better user experience for a small amount of dev effort.

-Jason

On Mon, Jan 26, 2015 at 3:14 PM, Ted Dunning <te...@gmail.com> wrote:

> I think that reading all as doubles is fine as an interim step.  This will
> work for very large numbers, but has the traditional problems with very
> large financial values, but I think that we aren't worried much yet about
> people talking about amounts > $10^17.
>
>
>
> On Mon, Jan 26, 2015 at 5:17 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > Writing zero int to a float column should be allowed.  Basically, if we
> > found a float previously and then we run across a zero, that should be
> > accepted.  This doesn't fix the situation where the first value was zero
> > but definitely fixes many situations.  I'm up for a second option to
> treat
> > all numbers as doubles but I'm not in support of it for the default as
> once
> > we finish embedded types, this would be our desired behavior.
> >
> > On Mon, Jan 26, 2015 at 1:36 PM, Jason Altekruse <
> altekrusejason@gmail.com
> > >
> > wrote:
> >
> > > Hello Drillers,
> > >
> > > I am currently working on improving the error reporting in the JSON
> > reader
> > > to help users with files that Drill cannot read using the default
> > > configuration today.
> > >
> > > As a part of this change I think it may be useful to change the default
> > > behavior for reading numbers in JSON documents. Currently we fail on a
> > > simple case with reading numbers with decimal points and then hit a
> value
> > > of 0 (or any number without a decimal point) in a later record. The
> > reason
> > > for the current behavior is to allow better precision in the case of
> > files
> > > with only integers. The issue however is that we currently fail on the
> > > basic case with a mix of intergers and decimal numbers. See [1] for
> more
> > > discussion on this.
> > >
> > > I propose that we switch the JSON reader to read all numbers as doubles
> > by
> > > default. The reader already contains a workaround that allows lossless
> > > casting to integers and decimal types with some extra computational
> > > overhead using all_text_mode, see more info below. [2]
> > >
> > > Please share your thoughts on this change.
> > >
> > > [1] https://issues.apache.org/jira/browse/DRILL-1460
> > > [2] https://issues.apache.org/jira/browse/DRILL-2071
> > >
> > > -Jason
> > >
> >
>

Re: [DISCUSS] Change default json read behavior for numbers

Posted by Ted Dunning <te...@gmail.com>.

I think that reading all as doubles is fine as an interim step.  This will
work for very large numbers, but has the traditional problems with very
large financial values, but I think that we aren't worried much yet about
people talking about amounts > $10^17.



On Mon, Jan 26, 2015 at 5:17 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Writing zero int to a float column should be allowed.  Basically, if we
> found a float previously and then we run across a zero, that should be
> accepted.  This doesn't fix the situation where the first value was zero
> but definitely fixes many situations.  I'm up for a second option to treat
> all numbers as doubles but I'm not in support of it for the default as once
> we finish embedded types, this would be our desired behavior.
>
> On Mon, Jan 26, 2015 at 1:36 PM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > Hello Drillers,
> >
> > I am currently working on improving the error reporting in the JSON
> reader
> > to help users with files that Drill cannot read using the default
> > configuration today.
> >
> > As a part of this change I think it may be useful to change the default
> > behavior for reading numbers in JSON documents. Currently we fail on a
> > simple case with reading numbers with decimal points and then hit a value
> > of 0 (or any number without a decimal point) in a later record. The
> reason
> > for the current behavior is to allow better precision in the case of
> files
> > with only integers. The issue however is that we currently fail on the
> > basic case with a mix of intergers and decimal numbers. See [1] for more
> > discussion on this.
> >
> > I propose that we switch the JSON reader to read all numbers as doubles
> by
> > default. The reader already contains a workaround that allows lossless
> > casting to integers and decimal types with some extra computational
> > overhead using all_text_mode, see more info below. [2]
> >
> > Please share your thoughts on this change.
> >
> > [1] https://issues.apache.org/jira/browse/DRILL-1460
> > [2] https://issues.apache.org/jira/browse/DRILL-2071
> >
> > -Jason
> >
>

Re: [DISCUSS] Change default json read behavior for numbers

Posted by Chris Westin <cw...@maprtech.com>.

JavaScript (and therefore JSON) defines all numbers to be 64 bit floating
point, even if they're written without decimals. So, if someone is writing
JSON, this would be their expectation. I would read them all as doubles.

=> http://www.w3schools.com/js/js_numbers.asp

On Mon, Jan 26, 2015 at 2:17 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Writing zero int to a float column should be allowed.  Basically, if we
> found a float previously and then we run across a zero, that should be
> accepted.  This doesn't fix the situation where the first value was zero
> but definitely fixes many situations.  I'm up for a second option to treat
> all numbers as doubles but I'm not in support of it for the default as once
> we finish embedded types, this would be our desired behavior.
>
> On Mon, Jan 26, 2015 at 1:36 PM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > Hello Drillers,
> >
> > I am currently working on improving the error reporting in the JSON
> reader
> > to help users with files that Drill cannot read using the default
> > configuration today.
> >
> > As a part of this change I think it may be useful to change the default
> > behavior for reading numbers in JSON documents. Currently we fail on a
> > simple case with reading numbers with decimal points and then hit a value
> > of 0 (or any number without a decimal point) in a later record. The
> reason
> > for the current behavior is to allow better precision in the case of
> files
> > with only integers. The issue however is that we currently fail on the
> > basic case with a mix of intergers and decimal numbers. See [1] for more
> > discussion on this.
> >
> > I propose that we switch the JSON reader to read all numbers as doubles
> by
> > default. The reader already contains a workaround that allows lossless
> > casting to integers and decimal types with some extra computational
> > overhead using all_text_mode, see more info below. [2]
> >
> > Please share your thoughts on this change.
> >
> > [1] https://issues.apache.org/jira/browse/DRILL-1460
> > [2] https://issues.apache.org/jira/browse/DRILL-2071
> >
> > -Jason
> >
>

Re: [DISCUSS] Change default json read behavior for numbers

Posted by Jacques Nadeau <ja...@apache.org>.

Writing zero int to a float column should be allowed.  Basically, if we
found a float previously and then we run across a zero, that should be
accepted.  This doesn't fix the situation where the first value was zero
but definitely fixes many situations.  I'm up for a second option to treat
all numbers as doubles but I'm not in support of it for the default as once
we finish embedded types, this would be our desired behavior.

On Mon, Jan 26, 2015 at 1:36 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Hello Drillers,
>
> I am currently working on improving the error reporting in the JSON reader
> to help users with files that Drill cannot read using the default
> configuration today.
>
> As a part of this change I think it may be useful to change the default
> behavior for reading numbers in JSON documents. Currently we fail on a
> simple case with reading numbers with decimal points and then hit a value
> of 0 (or any number without a decimal point) in a later record. The reason
> for the current behavior is to allow better precision in the case of files
> with only integers. The issue however is that we currently fail on the
> basic case with a mix of intergers and decimal numbers. See [1] for more
> discussion on this.
>
> I propose that we switch the JSON reader to read all numbers as doubles by
> default. The reader already contains a workaround that allows lossless
> casting to integers and decimal types with some extra computational
> overhead using all_text_mode, see more info below. [2]
>
> Please share your thoughts on this change.
>
> [1] https://issues.apache.org/jira/browse/DRILL-1460
> [2] https://issues.apache.org/jira/browse/DRILL-2071
>
> -Jason
>