You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/07/23 00:52:07 UTC

Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Hi,

I keep coming across *quirks* in Drill that are quite time consuming to
deal with and are now causing mounting concerns.

This last one though is far more serious then the previous ones because it
deals with loss of data.

I'm working with a small(ish) dataset of around 1m records (which I'm more
than happy to hand over to replicate this)

The problem goes like this:

1. with dfs.tmp.`/test.json`
- containing a structure like this (simplified);
- 800k x
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
- 100k
x {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"}}

2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
- returns only this for all the records: "yes",
{"other":"true","all":"false","sometimes":"yes"}
- never returns this:
"yes", {"other":"true","all":"false","sometimes":"yes"}

The query never returns returns this:
"yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"} so the last entries in the file are incorrectly represented.

To make matters a lot worse the the property is completely ignored in:
create X as * from dfs.tmp.`/test.json` and the now parquet file does not
include it at all.

It looks, to me, that the dynamic schema discovery has stopped looking for
schema changes and is quite set in it's way, so set in fact, that it's
ignoring data.

I'm guessing that this is potentially affecting more people than me.

I believe I have produced this under 1.1 and 1.2-SNAPSHOT.

Regards,
-Stefan

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Jinfeng Ni <ji...@gmail.com>.

Parth,

You are right. If we put t.others.additional in select list, in addition to
t.others, then the output is wrong.  The JSON file I used has 2 rows:

{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"}}

select `some`, t.others, t.others.additional from dfs.tmp.`test5.json` t ;
+-------+-------------------------------------+--------------------+
| some  |               others                |       EXPR$2       |
+-------+-------------------------------------+--------------------+
| yes   | {}                                  | null               |
| yes   | {"additional":"last entries only"}  | last entries only  |
+-------+-------------------------------------+--------------------+

Looks like t.others and t.others.additional is returned as empty map or
null, for the first row, while it returned partial fields for the second
row. Both are not valid.





On Thu, Jul 23, 2015 at 2:36 PM, Parth Chandra <pa...@apache.org> wrote:

> Given the sample rows that Stefan provided, the query -
>
>     select `some`, t.others, t.others.additional from `test.json` t;
> does produce incorrect results -
>
> *    | *yes * | *{"additional":"last entries only"} * | *last entries only
> *
> |*
>
> instead of
>
> *    | *yes * |
> *{"other":"true","all":"false","sometimes":"yes","additional":"last
> entries only"} * | *last entries only * |*
>
> Jinfeng, your item #4 is also an issue.
>
>
> I'll log JIRAs for these.
>
> Stefan, thank you for helping us out with catching these bugs. Your efforts
> are really appreciated.
>
> Parth
>
>
>
>
>
> On Thu, Jul 23, 2015 at 2:19 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > hi,
> >
> > I can provide you with json file an statements to reproduce it if you
> wish.
> >
> > thank you for looking into this.
> >
> > regards,
> >   -Stefan
> > On Jul 23, 2015 9:03 PM, "Jinfeng Ni" <ji...@gmail.com> wrote:
> >
> > > Hi Stefán,
> > >
> > > Thanks a lot for bringing up this issue, which is really helpful to
> > improve
> > > Drill.
> > >
> > > I tried to re-produce the incorrect issues, and I could re-produce the
> > > missing data issue of CTAS parquet, but I could not re-produce the
> > missing
> > > data issue if I query the JSON file directly.
> > >
> > > Here is how I tried:
> > >
> > > 1. with dfs.tmp.`test.json`
> > >   800k of
> > >
> > {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> > >   100k of
> > >   {"some":"yes","others":{"other":"true","all":"false","
> > > sometimes":"yes","additional":"last entries only"}}
> > >
> > > 2.  SELECT * from dfs.tmp.`test.json`;
> > > I put the output of the query into a file. Here is part of the result,
> > > shown in vim editor
> > >
> > > 824000
> > >
> > >
> >
> +------+------------------------------------------------------------------------------------+
> > > 824001 | some |                                       others
> > >                         |
> > > 824002
> > >
> > >
> >
> +------+------------------------------------------------------------------------------------+
> > > 824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
> > >                         |
> > > 824004 | yes  |
> > > {"other":"true","all":"false","sometimes":"yes","additional":"last
> > entries
> > > only"}  |
> > > 824005 | yes  |
> > > {"other":"true","all":"false","sometimes":"yes","additional":"last
> > entries
> > > only"}  |
> > >
> > > The left most number is the line number from vim editor.  The first
> > 824003
> > > lines have rows without the "additional" field, while beyond that each
> > row
> > > contains "additional" field.  The line number 824003 (not 800000) comes
> > > from the fact Drill's SqlLine add the columnName as the header for
> every
> > > hundreds rows (?).
> > >
> > > 3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;
> > >
> > > Same result as above.
> > >
> > > 4.  USE dfs.tmp;
> > >      CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
> > >      SELECT * from dfs.tmp.testparquet;
> > >
> > > This one return the missing data from the generated parquet file.
> > >
> > >
> > >  82400 +------+---------------------------------------------------+
> > >  82401 | some |                      others                       |
> > >  82402 +------+---------------------------------------------------+
> > >  82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> > >  82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> > >  82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> > >
> > >
> > > So, looks like there is a bug in the parquet writer operator, when it
> did
> > > not output the additional field into parquet files, while the query
> > against
> > > the JSON seems to return correct result.
> > >
> > > I just want to confirm whether you see similar behavior on your side.
> > >
> > > Thanks again!
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > > > Thank you.
> > > >
> > > >
> > > >
> > > > On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> > > > stefan@activitystream.com>
> > > > > wrote:
> > > > >
> > > > > > Someone must review the underlying optimization errors to prevent
> > > this
> > > > > from
> > > > > > happening to others.
> > > > > >
> > > > >
> > > > > Jinfeng and Parth are examining this issue to try to come to a
> deeper
> > > > > understanding.  Not surprisingly, they are a little quiet as they
> do
> > > > this.
> > > > >
> > > > >
> > > > > > JSON data, which is unstructured/schema-free in it's nature can
> not
> > > be
> > > > > > treated as consistent, predictable or monolithic.
> > > > > >
> > > > >
> > > > > Indeed.  And Drill vision is based on *exactly* this thought. Right
> > > now,
> > > > > Drill is still new and does not fulfill all aspects of the vision,
> > but
> > > we
> > > > > are making progress rapidly.
> > > > >
> > > > > Your contributions and comments have been very helpful, btw.
> > > > >
> > > >
> > >
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Parth Chandra <pa...@apache.org>.

Given the sample rows that Stefan provided, the query -

    select `some`, t.others, t.others.additional from `test.json` t;
does produce incorrect results -

*    | *yes * | *{"additional":"last entries only"} * | *last entries only *
|*

instead of

*    | *yes * |
*{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"} * | *last entries only * |*

Jinfeng, your item #4 is also an issue.


I'll log JIRAs for these.

Stefan, thank you for helping us out with catching these bugs. Your efforts
are really appreciated.

Parth





On Thu, Jul 23, 2015 at 2:19 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> hi,
>
> I can provide you with json file an statements to reproduce it if you wish.
>
> thank you for looking into this.
>
> regards,
>   -Stefan
> On Jul 23, 2015 9:03 PM, "Jinfeng Ni" <ji...@gmail.com> wrote:
>
> > Hi Stefán,
> >
> > Thanks a lot for bringing up this issue, which is really helpful to
> improve
> > Drill.
> >
> > I tried to re-produce the incorrect issues, and I could re-produce the
> > missing data issue of CTAS parquet, but I could not re-produce the
> missing
> > data issue if I query the JSON file directly.
> >
> > Here is how I tried:
> >
> > 1. with dfs.tmp.`test.json`
> >   800k of
> >
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> >   100k of
> >   {"some":"yes","others":{"other":"true","all":"false","
> > sometimes":"yes","additional":"last entries only"}}
> >
> > 2.  SELECT * from dfs.tmp.`test.json`;
> > I put the output of the query into a file. Here is part of the result,
> > shown in vim editor
> >
> > 824000
> >
> >
> +------+------------------------------------------------------------------------------------+
> > 824001 | some |                                       others
> >                         |
> > 824002
> >
> >
> +------+------------------------------------------------------------------------------------+
> > 824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
> >                         |
> > 824004 | yes  |
> > {"other":"true","all":"false","sometimes":"yes","additional":"last
> entries
> > only"}  |
> > 824005 | yes  |
> > {"other":"true","all":"false","sometimes":"yes","additional":"last
> entries
> > only"}  |
> >
> > The left most number is the line number from vim editor.  The first
> 824003
> > lines have rows without the "additional" field, while beyond that each
> row
> > contains "additional" field.  The line number 824003 (not 800000) comes
> > from the fact Drill's SqlLine add the columnName as the header for every
> > hundreds rows (?).
> >
> > 3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;
> >
> > Same result as above.
> >
> > 4.  USE dfs.tmp;
> >      CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
> >      SELECT * from dfs.tmp.testparquet;
> >
> > This one return the missing data from the generated parquet file.
> >
> >
> >  82400 +------+---------------------------------------------------+
> >  82401 | some |                      others                       |
> >  82402 +------+---------------------------------------------------+
> >  82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> >  82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> >  82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
> >
> >
> > So, looks like there is a bug in the parquet writer operator, when it did
> > not output the additional field into parquet files, while the query
> against
> > the JSON seems to return correct result.
> >
> > I just want to confirm whether you see similar behavior on your side.
> >
> > Thanks again!
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Thank you.
> > >
> > >
> > >
> > > On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> > > stefan@activitystream.com>
> > > > wrote:
> > > >
> > > > > Someone must review the underlying optimization errors to prevent
> > this
> > > > from
> > > > > happening to others.
> > > > >
> > > >
> > > > Jinfeng and Parth are examining this issue to try to come to a deeper
> > > > understanding.  Not surprisingly, they are a little quiet as they do
> > > this.
> > > >
> > > >
> > > > > JSON data, which is unstructured/schema-free in it's nature can not
> > be
> > > > > treated as consistent, predictable or monolithic.
> > > > >
> > > >
> > > > Indeed.  And Drill vision is based on *exactly* this thought. Right
> > now,
> > > > Drill is still new and does not fulfill all aspects of the vision,
> but
> > we
> > > > are making progress rapidly.
> > > >
> > > > Your contributions and comments have been very helpful, btw.
> > > >
> > >
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

hi,

I can provide you with json file an statements to reproduce it if you wish.

thank you for looking into this.

regards,
  -Stefan
On Jul 23, 2015 9:03 PM, "Jinfeng Ni" <ji...@gmail.com> wrote:

> Hi Stefán,
>
> Thanks a lot for bringing up this issue, which is really helpful to improve
> Drill.
>
> I tried to re-produce the incorrect issues, and I could re-produce the
> missing data issue of CTAS parquet, but I could not re-produce the missing
> data issue if I query the JSON file directly.
>
> Here is how I tried:
>
> 1. with dfs.tmp.`test.json`
>   800k of
>    {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
>   100k of
>   {"some":"yes","others":{"other":"true","all":"false","
> sometimes":"yes","additional":"last entries only"}}
>
> 2.  SELECT * from dfs.tmp.`test.json`;
> I put the output of the query into a file. Here is part of the result,
> shown in vim editor
>
> 824000
>
> +------+------------------------------------------------------------------------------------+
> 824001 | some |                                       others
>                         |
> 824002
>
> +------+------------------------------------------------------------------------------------+
> 824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
>                         |
> 824004 | yes  |
> {"other":"true","all":"false","sometimes":"yes","additional":"last entries
> only"}  |
> 824005 | yes  |
> {"other":"true","all":"false","sometimes":"yes","additional":"last entries
> only"}  |
>
> The left most number is the line number from vim editor.  The first 824003
> lines have rows without the "additional" field, while beyond that each row
> contains "additional" field.  The line number 824003 (not 800000) comes
> from the fact Drill's SqlLine add the columnName as the header for every
> hundreds rows (?).
>
> 3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;
>
> Same result as above.
>
> 4.  USE dfs.tmp;
>      CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
>      SELECT * from dfs.tmp.testparquet;
>
> This one return the missing data from the generated parquet file.
>
>
>  82400 +------+---------------------------------------------------+
>  82401 | some |                      others                       |
>  82402 +------+---------------------------------------------------+
>  82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
>  82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
>  82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
>
>
> So, looks like there is a bug in the parquet writer operator, when it did
> not output the additional field into parquet files, while the query against
> the JSON seems to return correct result.
>
> I just want to confirm whether you see similar behavior on your side.
>
> Thanks again!
>
>
>
>
>
>
>
>
> On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Thank you.
> >
> >
> >
> > On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > > > Someone must review the underlying optimization errors to prevent
> this
> > > from
> > > > happening to others.
> > > >
> > >
> > > Jinfeng and Parth are examining this issue to try to come to a deeper
> > > understanding.  Not surprisingly, they are a little quiet as they do
> > this.
> > >
> > >
> > > > JSON data, which is unstructured/schema-free in it's nature can not
> be
> > > > treated as consistent, predictable or monolithic.
> > > >
> > >
> > > Indeed.  And Drill vision is based on *exactly* this thought. Right
> now,
> > > Drill is still new and does not fulfill all aspects of the vision, but
> we
> > > are making progress rapidly.
> > >
> > > Your contributions and comments have been very helpful, btw.
> > >
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Jinfeng Ni <ji...@gmail.com>.

Hi Stefán,

Thanks a lot for bringing up this issue, which is really helpful to improve
Drill.

I tried to re-produce the incorrect issues, and I could re-produce the
missing data issue of CTAS parquet, but I could not re-produce the missing
data issue if I query the JSON file directly.

Here is how I tried:

1. with dfs.tmp.`test.json`
  800k of
   {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
  100k of
  {"some":"yes","others":{"other":"true","all":"false","
sometimes":"yes","additional":"last entries only"}}

2.  SELECT * from dfs.tmp.`test.json`;
I put the output of the query into a file. Here is part of the result,
shown in vim editor

824000
+------+------------------------------------------------------------------------------------+
824001 | some |                                       others
                        |
824002
+------+------------------------------------------------------------------------------------+
824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
                        |
824004 | yes  |
{"other":"true","all":"false","sometimes":"yes","additional":"last entries
only"}  |
824005 | yes  |
{"other":"true","all":"false","sometimes":"yes","additional":"last entries
only"}  |

The left most number is the line number from vim editor.  The first 824003
lines have rows without the "additional" field, while beyond that each row
contains "additional" field.  The line number 824003 (not 800000) comes
from the fact Drill's SqlLine add the columnName as the header for every
hundreds rows (?).

3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;

Same result as above.

4.  USE dfs.tmp;
     CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
     SELECT * from dfs.tmp.testparquet;

This one return the missing data from the generated parquet file.

 82400 +------+---------------------------------------------------+
 82401 | some |                      others                       |
 82402 +------+---------------------------------------------------+
 82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
 82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
 82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |

So, looks like there is a bug in the parquet writer operator, when it did
not output the additional field into parquet files, while the query against
the JSON seems to return correct result.

I just want to confirm whether you see similar behavior on your side.

Thanks again!

On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Thank you.
>
>
>
> On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Someone must review the underlying optimization errors to prevent this
> > from
> > > happening to others.
> > >
> >
> > Jinfeng and Parth are examining this issue to try to come to a deeper
> > understanding.  Not surprisingly, they are a little quiet as they do
> this.
> >
> >
> > > JSON data, which is unstructured/schema-free in it's nature can not be
> > > treated as consistent, predictable or monolithic.
> > >
> >
> > Indeed.  And Drill vision is based on *exactly* this thought. Right now,
> > Drill is still new and does not fulfill all aspects of the vision, but we
> > are making progress rapidly.
> >
> > Your contributions and comments have been very helpful, btw.
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

Thank you.



On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <te...@gmail.com> wrote:

> On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Someone must review the underlying optimization errors to prevent this
> from
> > happening to others.
> >
>
> Jinfeng and Parth are examining this issue to try to come to a deeper
> understanding.  Not surprisingly, they are a little quiet as they do this.
>
>
> > JSON data, which is unstructured/schema-free in it's nature can not be
> > treated as consistent, predictable or monolithic.
> >
>
> Indeed.  And Drill vision is based on *exactly* this thought. Right now,
> Drill is still new and does not fulfill all aspects of the vision, but we
> are making progress rapidly.
>
> Your contributions and comments have been very helpful, btw.
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Someone must review the underlying optimization errors to prevent this from
> happening to others.
>

Jinfeng and Parth are examining this issue to try to come to a deeper
understanding.  Not surprisingly, they are a little quiet as they do this.

> JSON data, which is unstructured/schema-free in it's nature can not be
> treated as consistent, predictable or monolithic.
>

Indeed.  And Drill vision is based on *exactly* this thought. Right now,
Drill is still new and does not fulfill all aspects of the vision, but we
are making progress rapidly.

Your contributions and comments have been very helpful, btw.

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

Hi Abdel,

Thank you for taking the time to respond. I know my frustration is leaking
through but that does not mean I don appreciate everything you and the
Drill team is doing, I do.

I also understand the premise of the optimization but I find it to
restrictive and it certainly does not fit our data where our customers are
responsible for part of the data (mixed schema).

Parquet seems to know how to deal with new properties and some systems,
like Spark, know how to deal with changes in (shema) property types over
time.

The main concern is this; Drill is a great attempt to support variable data
sources and evolving/mixed/variable schema but it does seems to be doing so
with a lot of restrictions that are counter productive to that goal.

I have had tests fail because fields are incorrectly assumed numbers if
enough of them, up front, are null values and then the system blows when a
string finally shows up. i have had string values, containing numbers, blow
up because previous values were "real numbers" without an a attempt being
done to convert the value to the "strict representation".

This, in addition to the other problems I have come across (in Jira), have
lead me to believe that Drill is great, just as long as all your data has
been sanitized. And that is a shame because the "Time to value" and
"dynamic schema discovery" promise fall a bit short under those
circumstances. I hope this will mature nicely over time and become more
forgiving and understanding to "data in the wild".

For now I need to be able to change that 1000 limit or eliminate it all
together, please let me know if that can be done.

Ultimately I fail to understand how the a change within a single json file
can ever be considered as "unsupported schema change".

very best,
 -Stefan






On Thu, Jul 23, 2015 at 2:53 PM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> Hi Stefan,
>
> Sorry to hear about your misadventure in Drill land. I will try to give you
> some more informations, but I also have limited knowledge for this specific
> case and other developers will probably jump in to correct me.
>
> When you try to read schema-less data, Drill will first investigate the
> 1000 rows to figure out a schema for your data, then it will use this
> schema for the remaining of the query. This explains why your workaround
> works because Drill was able to "figure out" that extra field, and will
> fill it with null whenever it's missing.
>
> To my knowledge, when the data contains a "schema change" (like in your
> case) you generally get an error message stating that "operator X doesn't
> support schema changes", so I am not sure why you are getting incorrect
> results in this case.
>
> You should definitely fill a JIRA for this and mark it as critical. We try
> to fix cases where a query returns incorrect results as soon as possible.
>
> Thank you again for all the efforts your are putting in Drill
>
>
> On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > The workaround for this was to edit the first line in the json file and
> > fake a value for the "additional" field.
> > That way the optimizer could not decide to ignore it.
> >
> > Someone must review the underlying optimization errors to prevent this
> from
> > happening to others.
> >
> > JSON data, which is unstructured/schema-free in it's nature can not be
> > treated as consistent, predictable or monolithic.
> >
> > I hope this workaround/tip is useful for someone and that someone here
> > cares enough to create a blocker issue in jira.
> > (yeah, 14 hours of staring at this and related issues has left me a bit
> > rude and I know I could focus on appreciating all the effort but that
> will
> > have to wait just a bit longer)
> >
> > - Stefan
> >
> >
> > On Wed, Jul 22, 2015 at 11:01 PM, Stefán Baxter <
> stefan@activitystream.com
> > >
> > wrote:
> >
> > > in addition to this.
> > >
> > > selecting: select some, t.others, t.others.additional from
> > dfs.tmp.`/test.json`
> > > as t;
> > > - returns this: "yes", {"additional":"last entries only"}, "last
> entries
> > > only"
> > >
> > > finding the previously missing value but then ignoring all the other
> > > values of the sub structure.
> > >
> > > - Stefan
> > >
> > > On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <
> > stefan@activitystream.com
> > > > wrote:
> > >
> > >> - never returns this: "yes", {"other":"true","all":"
> > >> false","sometimes":"yes"}
> > >>
> > >> should have been:
> > >>
> > >> - never returns this: "yes", {"other":"true","all":"
> > >> false","sometimes":"yes", "additional":"last entries only"}
> > >>
> > >> Regards,
> > >>  -Stefan
> > >>
> > >> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <
> > >> stefan@activitystream.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I keep coming across *quirks* in Drill that are quite time consuming
> to
> > >>> deal with and are now causing mounting concerns.
> > >>>
> > >>> This last one though is far more serious then the previous ones
> because
> > >>> it deals with loss of data.
> > >>>
> > >>> I'm working with a small(ish) dataset of around 1m records (which I'm
> > >>> more than happy to hand over to replicate this)
> > >>>
> > >>> The problem goes like this:
> > >>>
> > >>>    1. with dfs.tmp.`/test.json`
> > >>>    - containing a structure like this (simplified);
> > >>>    - 800k x
> > >>>
> > {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> > >>>    - 100k
> > >>>    x
> >
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
> > >>>    entries only"}}
> > >>>
> > >>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as
> t;
> > >>>    - returns only this for all the records: "yes",
> > >>>    {"other":"true","all":"false","sometimes":"yes"}
> > >>>    - never returns this:
> > >>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
> > >>>
> > >>> The query never returns returns this:
> > >>> "yes",
> > {"other":"true","all":"false","sometimes":"yes","additional":"last
> > >>> entries only"} so the last entries in the file are incorrectly
> > represented.
> > >>>
> > >>> To make matters a lot worse the the property is completely ignored
> in:
> > >>> create X as * from dfs.tmp.`/test.json` and the now parquet file does
> > not
> > >>> include it at all.
> > >>>
> > >>> It looks, to me, that the dynamic schema discovery has stopped
> looking
> > >>> for schema changes and is quite set in it's way, so set in fact, that
> > it's
> > >>> ignoring data.
> > >>>
> > >>> I'm guessing that this is potentially affecting more people than me.
> > >>>
> > >>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
> > >>>
> > >>> Regards,
> > >>>  -Stefan
> > >>>
> > >>
> > >>
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

I don't think Drill is supposed to "ignore" data. My understanding is that
the reader will read the new fields which will cause a schema change, and
depending on the query (if all operators involved can handle the schema
change or not) the query should either succeed or fail.
My understanding is that Drill will most likely fail rather than display
incorrect results otherwise it's a bug that needs to be fixed.
Sometimes, the reader itself my fail for example if you have a list of
numbers and the first 1000 values are int, if any value after that is
double or string, this will cause the json reader to fail.

On Thu, Jul 23, 2015 at 9:16 AM, Matt <bs...@gmail.com> wrote:

> On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote:
>
>  When you try to read schema-less data, Drill will first investigate the
>> 1000 rows to figure out a schema for your data, then it will use this
>> schema for the remaining of the query.
>>
>
> To clarify, if the JSON schema changes on the 1001st 1MMth record, is
> Drill supposed to report an error, or ignore new data elements and only
> consider those discovered in the first 1000 objects?
>

-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

Hi,

The only right answer to this question must be to a) "adapt to additional
information" and b) "try the hardest to accommodate changes".
The current behavior must be seen as completely worthless (sorry for the
strong language).

Regards,
 -Stefan

On Thu, Jul 23, 2015 at 4:16 PM, Matt <bs...@gmail.com> wrote:

> On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote:
>
>  When you try to read schema-less data, Drill will first investigate the
>> 1000 rows to figure out a schema for your data, then it will use this
>> schema for the remaining of the query.
>>
>
> To clarify, if the JSON schema changes on the 1001st 1MMth record, is
> Drill supposed to report an error, or ignore new data elements and only
> consider those discovered in the first 1000 objects?
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Matt <bs...@gmail.com>.

On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote:

> When you try to read schema-less data, Drill will first investigate 
> the
> 1000 rows to figure out a schema for your data, then it will use this
> schema for the remaining of the query.

To clarify, if the JSON schema changes on the 1001st 1MMth record, is 
Drill supposed to report an error, or ignore new data elements and only 
consider those discovered in the first 1000 objects?

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

Hi Stefan,

Sorry to hear about your misadventure in Drill land. I will try to give you
some more informations, but I also have limited knowledge for this specific
case and other developers will probably jump in to correct me.

When you try to read schema-less data, Drill will first investigate the
1000 rows to figure out a schema for your data, then it will use this
schema for the remaining of the query. This explains why your workaround
works because Drill was able to "figure out" that extra field, and will
fill it with null whenever it's missing.

To my knowledge, when the data contains a "schema change" (like in your
case) you generally get an error message stating that "operator X doesn't
support schema changes", so I am not sure why you are getting incorrect
results in this case.

You should definitely fill a JIRA for this and mark it as critical. We try
to fix cases where a query returns incorrect results as soon as possible.

Thank you again for all the efforts your are putting in Drill


On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> The workaround for this was to edit the first line in the json file and
> fake a value for the "additional" field.
> That way the optimizer could not decide to ignore it.
>
> Someone must review the underlying optimization errors to prevent this from
> happening to others.
>
> JSON data, which is unstructured/schema-free in it's nature can not be
> treated as consistent, predictable or monolithic.
>
> I hope this workaround/tip is useful for someone and that someone here
> cares enough to create a blocker issue in jira.
> (yeah, 14 hours of staring at this and related issues has left me a bit
> rude and I know I could focus on appreciating all the effort but that will
> have to wait just a bit longer)
>
> - Stefan
>
>
> On Wed, Jul 22, 2015 at 11:01 PM, Stefán Baxter <stefan@activitystream.com
> >
> wrote:
>
> > in addition to this.
> >
> > selecting: select some, t.others, t.others.additional from
> dfs.tmp.`/test.json`
> > as t;
> > - returns this: "yes", {"additional":"last entries only"}, "last entries
> > only"
> >
> > finding the previously missing value but then ignoring all the other
> > values of the sub structure.
> >
> > - Stefan
> >
> > On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <
> stefan@activitystream.com
> > > wrote:
> >
> >> - never returns this: "yes", {"other":"true","all":"
> >> false","sometimes":"yes"}
> >>
> >> should have been:
> >>
> >> - never returns this: "yes", {"other":"true","all":"
> >> false","sometimes":"yes", "additional":"last entries only"}
> >>
> >> Regards,
> >>  -Stefan
> >>
> >> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <
> >> stefan@activitystream.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> I keep coming across *quirks* in Drill that are quite time consuming to
> >>> deal with and are now causing mounting concerns.
> >>>
> >>> This last one though is far more serious then the previous ones because
> >>> it deals with loss of data.
> >>>
> >>> I'm working with a small(ish) dataset of around 1m records (which I'm
> >>> more than happy to hand over to replicate this)
> >>>
> >>> The problem goes like this:
> >>>
> >>>    1. with dfs.tmp.`/test.json`
> >>>    - containing a structure like this (simplified);
> >>>    - 800k x
> >>>
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
> >>>    - 100k
> >>>    x
> {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
> >>>    entries only"}}
> >>>
> >>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
> >>>    - returns only this for all the records: "yes",
> >>>    {"other":"true","all":"false","sometimes":"yes"}
> >>>    - never returns this:
> >>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
> >>>
> >>> The query never returns returns this:
> >>> "yes",
> {"other":"true","all":"false","sometimes":"yes","additional":"last
> >>> entries only"} so the last entries in the file are incorrectly
> represented.
> >>>
> >>> To make matters a lot worse the the property is completely ignored in:
> >>> create X as * from dfs.tmp.`/test.json` and the now parquet file does
> not
> >>> include it at all.
> >>>
> >>> It looks, to me, that the dynamic schema discovery has stopped looking
> >>> for schema changes and is quite set in it's way, so set in fact, that
> it's
> >>> ignoring data.
> >>>
> >>> I'm guessing that this is potentially affecting more people than me.
> >>>
> >>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
> >>>
> >>> Regards,
> >>>  -Stefan
> >>>
> >>
> >>
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

Hi,

The workaround for this was to edit the first line in the json file and
fake a value for the "additional" field.
That way the optimizer could not decide to ignore it.

Someone must review the underlying optimization errors to prevent this from
happening to others.

JSON data, which is unstructured/schema-free in it's nature can not be
treated as consistent, predictable or monolithic.

I hope this workaround/tip is useful for someone and that someone here
cares enough to create a blocker issue in jira.
(yeah, 14 hours of staring at this and related issues has left me a bit
rude and I know I could focus on appreciating all the effort but that will
have to wait just a bit longer)

- Stefan


On Wed, Jul 22, 2015 at 11:01 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> in addition to this.
>
> selecting: select some, t.others, t.others.additional from dfs.tmp.`/test.json`
> as t;
> - returns this: "yes", {"additional":"last entries only"}, "last entries
> only"
>
> finding the previously missing value but then ignoring all the other
> values of the sub structure.
>
> - Stefan
>
> On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <stefan@activitystream.com
> > wrote:
>
>> - never returns this: "yes", {"other":"true","all":"
>> false","sometimes":"yes"}
>>
>> should have been:
>>
>> - never returns this: "yes", {"other":"true","all":"
>> false","sometimes":"yes", "additional":"last entries only"}
>>
>> Regards,
>>  -Stefan
>>
>> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <
>> stefan@activitystream.com> wrote:
>>
>>> Hi,
>>>
>>> I keep coming across *quirks* in Drill that are quite time consuming to
>>> deal with and are now causing mounting concerns.
>>>
>>> This last one though is far more serious then the previous ones because
>>> it deals with loss of data.
>>>
>>> I'm working with a small(ish) dataset of around 1m records (which I'm
>>> more than happy to hand over to replicate this)
>>>
>>> The problem goes like this:
>>>
>>>    1. with dfs.tmp.`/test.json`
>>>    - containing a structure like this (simplified);
>>>    - 800k x
>>>    {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
>>>    - 100k
>>>    x {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
>>>    entries only"}}
>>>
>>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
>>>    - returns only this for all the records: "yes",
>>>    {"other":"true","all":"false","sometimes":"yes"}
>>>    - never returns this:
>>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
>>>
>>> The query never returns returns this:
>>> "yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
>>> entries only"} so the last entries in the file are incorrectly represented.
>>>
>>> To make matters a lot worse the the property is completely ignored in:
>>> create X as * from dfs.tmp.`/test.json` and the now parquet file does not
>>> include it at all.
>>>
>>> It looks, to me, that the dynamic schema discovery has stopped looking
>>> for schema changes and is quite set in it's way, so set in fact, that it's
>>> ignoring data.
>>>
>>> I'm guessing that this is potentially affecting more people than me.
>>>
>>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
>>>
>>> Regards,
>>>  -Stefan
>>>
>>
>>
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

in addition to this.

selecting: select some, t.others, t.others.additional from dfs.tmp.`/test.json`
as t;
- returns this: "yes", {"additional":"last entries only"}, "last entries
only"

finding the previously missing value but then ignoring all the other values
of the sub structure.

- Stefan

On Wed, Jul 22, 2015 at 10:53 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> - never returns this: "yes", {"other":"true","all":"
> false","sometimes":"yes"}
>
> should have been:
>
> - never returns this: "yes", {"other":"true","all":"
> false","sometimes":"yes", "additional":"last entries only"}
>
> Regards,
>  -Stefan
>
> On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <stefan@activitystream.com
> > wrote:
>
>> Hi,
>>
>> I keep coming across *quirks* in Drill that are quite time consuming to
>> deal with and are now causing mounting concerns.
>>
>> This last one though is far more serious then the previous ones because
>> it deals with loss of data.
>>
>> I'm working with a small(ish) dataset of around 1m records (which I'm
>> more than happy to hand over to replicate this)
>>
>> The problem goes like this:
>>
>>    1. with dfs.tmp.`/test.json`
>>    - containing a structure like this (simplified);
>>    - 800k x
>>    {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
>>    - 100k
>>    x {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
>>    entries only"}}
>>
>>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
>>    - returns only this for all the records: "yes",
>>    {"other":"true","all":"false","sometimes":"yes"}
>>    - never returns this:
>>    "yes", {"other":"true","all":"false","sometimes":"yes"}
>>
>> The query never returns returns this:
>> "yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
>> entries only"} so the last entries in the file are incorrectly represented.
>>
>> To make matters a lot worse the the property is completely ignored in:
>> create X as * from dfs.tmp.`/test.json` and the now parquet file does not
>> include it at all.
>>
>> It looks, to me, that the dynamic schema discovery has stopped looking
>> for schema changes and is quite set in it's way, so set in fact, that it's
>> ignoring data.
>>
>> I'm guessing that this is potentially affecting more people than me.
>>
>> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
>>
>> Regards,
>>  -Stefan
>>
>
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Posted by Stefán Baxter <st...@activitystream.com>.

- never returns this: "yes", {"other":"true","all":"
false","sometimes":"yes"}

should have been:

- never returns this: "yes", {"other":"true","all":"
false","sometimes":"yes", "additional":"last entries only"}

Regards,
 -Stefan

On Wed, Jul 22, 2015 at 10:52 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I keep coming across *quirks* in Drill that are quite time consuming to
> deal with and are now causing mounting concerns.
>
> This last one though is far more serious then the previous ones because it
> deals with loss of data.
>
> I'm working with a small(ish) dataset of around 1m records (which I'm more
> than happy to hand over to replicate this)
>
> The problem goes like this:
>
>    1. with dfs.tmp.`/test.json`
>    - containing a structure like this (simplified);
>    - 800k x
>    {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
>    - 100k
>    x {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
>    entries only"}}
>
>    2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
>    - returns only this for all the records: "yes",
>    {"other":"true","all":"false","sometimes":"yes"}
>    - never returns this:
>    "yes", {"other":"true","all":"false","sometimes":"yes"}
>
> The query never returns returns this:
> "yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
> entries only"} so the last entries in the file are incorrectly represented.
>
> To make matters a lot worse the the property is completely ignored in:
> create X as * from dfs.tmp.`/test.json` and the now parquet file does not
> include it at all.
>
> It looks, to me, that the dynamic schema discovery has stopped looking for
> schema changes and is quite set in it's way, so set in fact, that it's
> ignoring data.
>
> I'm guessing that this is potentially affecting more people than me.
>
> I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
>
> Regards,
>  -Stefan
>