You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/07/27 13:59:45 UTC

Type confusion and number formatting exceptions

Hi,

It seems that null values can trigger a column to be treated as a numeric
one, in expressions evaluation, regardless of content or other indicators
and that fields in substructures can affect same-named-fields in parent
structure.
(1.2-SNAPSHOT, parquet files)

I have JSON data that can be reduced to to this:

   - {"occurred_at":"2015-07-26
   08:45:41.234","type":"plan.item.added","dimensions":{"type":null,"dim_type":"Unspecified","category":"Unspecified","sub_category":null}}
   - {"occurred_at":"2015-07-26
   08:45:43.598","type":"plan.item.removed","dimensions":{"type":"Unspecified","dim_type":null,"category":"Unspecified","sub_category":null}}
   - {"occurred_at":"2015-07-26
   08:45:44.241","type":"plan.item.removed","dimensions":{"type":"To
   See","category":"Nature","sub_category":"Waterfalls"}}

* notice the discrepancy in the dimensions structure that the type field is
either called type or dim_type (slightly relevant for the rest of this case)


*1. Query where dimensions are not involved*

select p.type, count(*) from
dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where occurred_at
> '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
by p.type;
+--------------------+---------+
|        type        | EXPR$1  |
+--------------------+---------+
| plan.item.removed  | 947     |
| plan.item.added    | 40342   |
+--------------------+---------+
2 rows selected (0.508 seconds)


*2. Same query but involves dimension.type as well*

select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type)
dimensions_type, count(*) from
dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where occurred_at
> '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type);

Error: SYSTEM ERROR: NumberFormatException: To See
Fragment 2:0
[Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010]
(state=,code=0)


I can provide test data if this is not enough to reproduce this bug.

Regards,
 -Stefán

Re: Type confusion and number formatting exceptions

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

Would it not be possible to determine this on first-value or know if a
"unused default type" is being used and then change it once a value
presents it self?

Regards,
 -Stefán

On Wed, Jul 29, 2015 at 1:59 AM, Steven Phillips <sp...@maprtech.com>
wrote:

> This issue is addressed in DRILL-3477.
>
> My proposed solution is to use VarBinary as the default type, as this has
> the lowest precedence in the rules for implicit cast.
>
> Making this change breaks some other things though, so I haven't been able
> to merge the fix yet.
>
> On Tue, Jul 28, 2015 at 2:23 PM, Parth Chandra <pa...@apache.org> wrote:
>
> > Hi Stefan
> >   This is the same old issue: Drill does an initial scan to determine the
> > type of a field. In cases where Drill encounters nulls in the data it
> > defaults to using a Nullable Int as the type (not a good choice perhaps).
> >   This leads to all sorts of issues (most of which you're hitting).
> >   There is an effort to improve this (DRILL-3228) but it will be a while
> > before this work is completed.
> >
> >   In the meantime, I can only suggest a workaround : use as cast around
> > your columns -
> >
> >  select p.type, coalesce( cast(p.dimensions.dim_type as varchar(20)),
> > cast(p.dimensions.type as varchar(20))) dimensions_type, count(*) from
> > `test.json` as p where occurred_at > '2015-07-26' and p.type in
> > ('plan.item.added','plan.item.removed') group by p.type,
> > coalesce(cast(p.dimensions.dim_type as varchar(20)),
> cast(p.dimensions.type
> > as varchar(20)));
> >
> >
> >
> >
> >
> > On Mon, Jul 27, 2015 at 4:59 AM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Hi,
> > >
> > > It seems that null values can trigger a column to be treated as a
> numeric
> > > one, in expressions evaluation, regardless of content or other
> indicators
> > > and that fields in substructures can affect same-named-fields in parent
> > > structure.
> > > (1.2-SNAPSHOT, parquet files)
> > >
> > > I have JSON data that can be reduced to to this:
> > >
> > >    - {"occurred_at":"2015-07-26
> > >
> > >
> >
> 08:45:41.234","type":"plan.item.added","dimensions":{"type":null,"dim_type":"Unspecified","category":"Unspecified","sub_category":null}}
> > >    - {"occurred_at":"2015-07-26
> > >
> > >
> >
> 08:45:43.598","type":"plan.item.removed","dimensions":{"type":"Unspecified","dim_type":null,"category":"Unspecified","sub_category":null}}
> > >    - {"occurred_at":"2015-07-26
> > >    08:45:44.241","type":"plan.item.removed","dimensions":{"type":"To
> > >    See","category":"Nature","sub_category":"Waterfalls"}}
> > >
> > > * notice the discrepancy in the dimensions structure that the type
> field
> > is
> > > either called type or dim_type (slightly relevant for the rest of this
> > > case)
> > >
> > >
> > > *1. Query where dimensions are not involved*
> > >
> > > select p.type, count(*) from
> > > dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where
> > occurred_at
> > > > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed')
> > group
> > > by p.type;
> > > +--------------------+---------+
> > > |        type        | EXPR$1  |
> > > +--------------------+---------+
> > > | plan.item.removed  | 947     |
> > > | plan.item.added    | 40342   |
> > > +--------------------+---------+
> > > 2 rows selected (0.508 seconds)
> > >
> > >
> > > *2. Same query but involves dimension.type as well*
> > >
> > > select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type)
> > > dimensions_type, count(*) from
> > > dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where
> > occurred_at
> > > > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed')
> > group
> > > by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type);
> > >
> > > Error: SYSTEM ERROR: NumberFormatException: To See
> > > Fragment 2:0
> > > [Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010]
> > > (state=,code=0)
> > >
> > >
> > > I can provide test data if this is not enough to reproduce this bug.
> > >
> > > Regards,
> > >  -Stefán
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Type confusion and number formatting exceptions

Posted by Steven Phillips <sp...@maprtech.com>.
This issue is addressed in DRILL-3477.

My proposed solution is to use VarBinary as the default type, as this has
the lowest precedence in the rules for implicit cast.

Making this change breaks some other things though, so I haven't been able
to merge the fix yet.

On Tue, Jul 28, 2015 at 2:23 PM, Parth Chandra <pa...@apache.org> wrote:

> Hi Stefan
>   This is the same old issue: Drill does an initial scan to determine the
> type of a field. In cases where Drill encounters nulls in the data it
> defaults to using a Nullable Int as the type (not a good choice perhaps).
>   This leads to all sorts of issues (most of which you're hitting).
>   There is an effort to improve this (DRILL-3228) but it will be a while
> before this work is completed.
>
>   In the meantime, I can only suggest a workaround : use as cast around
> your columns -
>
>  select p.type, coalesce( cast(p.dimensions.dim_type as varchar(20)),
> cast(p.dimensions.type as varchar(20))) dimensions_type, count(*) from
> `test.json` as p where occurred_at > '2015-07-26' and p.type in
> ('plan.item.added','plan.item.removed') group by p.type,
> coalesce(cast(p.dimensions.dim_type as varchar(20)), cast(p.dimensions.type
> as varchar(20)));
>
>
>
>
>
> On Mon, Jul 27, 2015 at 4:59 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > It seems that null values can trigger a column to be treated as a numeric
> > one, in expressions evaluation, regardless of content or other indicators
> > and that fields in substructures can affect same-named-fields in parent
> > structure.
> > (1.2-SNAPSHOT, parquet files)
> >
> > I have JSON data that can be reduced to to this:
> >
> >    - {"occurred_at":"2015-07-26
> >
> >
> 08:45:41.234","type":"plan.item.added","dimensions":{"type":null,"dim_type":"Unspecified","category":"Unspecified","sub_category":null}}
> >    - {"occurred_at":"2015-07-26
> >
> >
> 08:45:43.598","type":"plan.item.removed","dimensions":{"type":"Unspecified","dim_type":null,"category":"Unspecified","sub_category":null}}
> >    - {"occurred_at":"2015-07-26
> >    08:45:44.241","type":"plan.item.removed","dimensions":{"type":"To
> >    See","category":"Nature","sub_category":"Waterfalls"}}
> >
> > * notice the discrepancy in the dimensions structure that the type field
> is
> > either called type or dim_type (slightly relevant for the rest of this
> > case)
> >
> >
> > *1. Query where dimensions are not involved*
> >
> > select p.type, count(*) from
> > dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where
> occurred_at
> > > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed')
> group
> > by p.type;
> > +--------------------+---------+
> > |        type        | EXPR$1  |
> > +--------------------+---------+
> > | plan.item.removed  | 947     |
> > | plan.item.added    | 40342   |
> > +--------------------+---------+
> > 2 rows selected (0.508 seconds)
> >
> >
> > *2. Same query but involves dimension.type as well*
> >
> > select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type)
> > dimensions_type, count(*) from
> > dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where
> occurred_at
> > > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed')
> group
> > by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type);
> >
> > Error: SYSTEM ERROR: NumberFormatException: To See
> > Fragment 2:0
> > [Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010]
> > (state=,code=0)
> >
> >
> > I can provide test data if this is not enough to reproduce this bug.
> >
> > Regards,
> >  -Stefán
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Type confusion and number formatting exceptions

Posted by Parth Chandra <pa...@apache.org>.
Hi Stefan
  This is the same old issue: Drill does an initial scan to determine the
type of a field. In cases where Drill encounters nulls in the data it
defaults to using a Nullable Int as the type (not a good choice perhaps).
  This leads to all sorts of issues (most of which you're hitting).
  There is an effort to improve this (DRILL-3228) but it will be a while
before this work is completed.

  In the meantime, I can only suggest a workaround : use as cast around
your columns -

 select p.type, coalesce( cast(p.dimensions.dim_type as varchar(20)),
cast(p.dimensions.type as varchar(20))) dimensions_type, count(*) from
`test.json` as p where occurred_at > '2015-07-26' and p.type in
('plan.item.added','plan.item.removed') group by p.type,
coalesce(cast(p.dimensions.dim_type as varchar(20)), cast(p.dimensions.type
as varchar(20)));





On Mon, Jul 27, 2015 at 4:59 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> It seems that null values can trigger a column to be treated as a numeric
> one, in expressions evaluation, regardless of content or other indicators
> and that fields in substructures can affect same-named-fields in parent
> structure.
> (1.2-SNAPSHOT, parquet files)
>
> I have JSON data that can be reduced to to this:
>
>    - {"occurred_at":"2015-07-26
>
>  08:45:41.234","type":"plan.item.added","dimensions":{"type":null,"dim_type":"Unspecified","category":"Unspecified","sub_category":null}}
>    - {"occurred_at":"2015-07-26
>
>  08:45:43.598","type":"plan.item.removed","dimensions":{"type":"Unspecified","dim_type":null,"category":"Unspecified","sub_category":null}}
>    - {"occurred_at":"2015-07-26
>    08:45:44.241","type":"plan.item.removed","dimensions":{"type":"To
>    See","category":"Nature","sub_category":"Waterfalls"}}
>
> * notice the discrepancy in the dimensions structure that the type field is
> either called type or dim_type (slightly relevant for the rest of this
> case)
>
>
> *1. Query where dimensions are not involved*
>
> select p.type, count(*) from
> dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where occurred_at
> > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
> by p.type;
> +--------------------+---------+
> |        type        | EXPR$1  |
> +--------------------+---------+
> | plan.item.removed  | 947     |
> | plan.item.added    | 40342   |
> +--------------------+---------+
> 2 rows selected (0.508 seconds)
>
>
> *2. Same query but involves dimension.type as well*
>
> select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type)
> dimensions_type, count(*) from
> dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where occurred_at
> > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
> by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type);
>
> Error: SYSTEM ERROR: NumberFormatException: To See
> Fragment 2:0
> [Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010]
> (state=,code=0)
>
>
> I can provide test data if this is not enough to reproduce this bug.
>
> Regards,
>  -Stefán
>