You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jason Altekruse <al...@gmail.com> on 2014/08/07 21:30:19 UTC
Handling numeric types in JSON

Hello Drillers,

I have been taking a look at our open JIRA's for our JSON reader and trying
to figure out the best way to improve the usability of Drill without adding
too much complexity to the reader. I have opened new issues for general
purpose schema change handling, which we have moved out beyond the next few
releases. While we had previously handled schema changes in JSON by simply
starting a new batch, this gets to be too costly when there are a lot of
columns that can all change schema. This feature also isn't very useful, as
many of the operators either fail or incorrectly handle schema changes
currently. The planned solution is a new Embedded Vector type that will
allow individual elements in the vector to have their own type, rather than
enforce one type for the whole vector. However this will complicate code
generation and other parts of Drill significantly.

The current JSON reader implementation makes use of the FieldWriter,
MapWriter, and ListWriter interfaces that were added to structure the
addition of nested data support for operators that need to write into
complex vectors. Unfortunately these interfaces are designed to write a
single type throughout their existence, and currently fail even in simple
cases we should be handling. One example is a field holding a number
without a decimal (currently read as an Int in Drill) followed by one with
a decimal (currently read into a BigInt in Drill) will fail.

{
      "field_1" : 1
}
{
      "field_1" : 5.2
}

Obviously this should not be considered a schema change, but it is not a
problem with a trivial solution. For the case where we see a mix of numeric
types we obviously do not want to lose precision, but there is actually no
way to guarantee this entirely. The JSON spec is intentionally vague to
allow for an arbitrary number of digits before or after the decimal point
in a number. Obviously without resorting to an extremely inefficient type
like BigDecimal (which as far as I know can only be created from or
exported as a string) we cannot support infinite precision. This leaves us
with the question of what use cases are going to be common among users, and
how they will expect Drill to act.

There are a few proposed solutions for the time being:
1. read all data as a Float8, taking precision loss with extremely large or
small values
2. read all data as decimal, although we need to decide where to place the
decimal point somehow, we can probe the dataset, but this will always just
be a guess
3. have an 'all text' mode for JSON that allows users to at least get all
of their data into the Drill engine. From there they can use string
functions and case statements to identify the scale/type of their
individual values and handle them how they wish. Obviously this will make
queries very messy

Obviously we need not enforce a type for all numbers in JSON, as we can
attempt to read as an integer, long, or a particular precision decimal
representation and go back to re-write data that has been found earlier if
need be.

What are your thoughts?