You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/11/05 15:48:18 UTC

Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Hey all,  to facilitate loading of some data from JSON to Parquet, I am
using the the load into "day" based directories...

parqtable
|
|_______2015-11-01
|
|_______2015-11-02
|
|_______2015-11-03
|
|_______2015-11-04

That way I can do select * from `parqtable` where dir0 = '2015-11-01' and
other cool tricks. It also helps my data loading.

I am using the exact same query to load each day.

CREATE TABLE `parqtable/2015-11-01' as
(select field1, field2, field3, field4 from jsontable where dir0 =
'2015-11-01')

CREATE TABLE `parqtable/2015-11-02' as
(select field1, field2, field3, field4 from jsontable where dir0 =
'2015-11-02')

CREATE TABLE `parqtable/2015-11-03' as
(select field1, field2, field3, field4 from jsontable where dir0 =
'2015-11-03')

Etc

This seams to work well except for one thing:

If I want to see the count per directory, this (what I thought was obvious)
query:

select dir0, count(*) from `parqtable` group by dir0

fails with

Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support schema
changes

Fragment: 2:8


I am not sure why this would be the case, the data is loaded by the same
query, I would assume the schema is the same....

Thoughts on how to troubleshoot?

Thanks!

John

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by John Omernik <jo...@omernik.com>.

Is there a way to see what Drill thinks the schema is for a group a files
(a show schema or something similar)

Thanks


On Thu, Nov 5, 2015 at 10:34 AM, Jason Altekruse <al...@gmail.com>
wrote:

> It is possible to introduce a schema change when you are writing parquet
> from JSON files. If you read part of the JSON that only contains some
> fields, and later see a new field, after we have written a few batches of
> data to the parquet file, we will close the current file and open a new
> one. Parquet currently requires that all columns be known when you start
> writing.
>
> Unfortunately this means that when you query the directory you would see
> schema changes. This is my best guess as to why you are seeing this
> behavior. We have a number of tasks outstanding around handling schema
> change in different parts of the engine, so we will be working to remove
> this restriction. If you need a workaround today, you could try to figure
> out what the complete schema in your JSON will be and just put one record
> at the top of each file that has data in each field that could possibly
> appear in the dataset later. This will tell all of the JSON readers the
> full schema at the start. There currently is no way to provide this
> information to the reader without putting it in the JSON data itself.
>
> On Thu, Nov 5, 2015 at 8:16 AM, John Omernik <jo...@omernik.com> wrote:
>
> > I am getting the same error as before.  I am using the same query to
> create
> > my tables, so I am unsure why it would see the schema as different. Is
> > there way to view the schema for each directory to see what Drill thinks
> is
> > happening?
> >
> > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > Have you tried to create a view on top of parqtable, and then use the
> > view?
> > >
> > > I did a quick experiment that may help you.
> > >
> > > Created a dir structure as follows (using csv files - what I had on
> hand)
> > > orders
> > > |____month1
> > > |____month2
> > > .
> > > .
> > > |____month9
> > >
> > > create or replace view dfs.views.ordersdir as select dir0, * from
> > > dfs.data.`/orders`;
> > >
> > > 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> > >
> > >
> >
> +---------+-------------------------------------------------------------------+---------+
> > > |  dir0   |                              columns
> > >     |  dir00  |
> > >
> > >
> >
> +---------+-------------------------------------------------------------------+---------+
> > > | month1  | ["8000","January","2014-01-02
> > > 23:06:07","15349","ri","203","26"]  | month1  |
> > > | month1  | ["8001","January","2014-01-31
> > > 03:40:21","16033","oh","113","23"]  | month1  |
> > >
> > >
> >
> +---------+-------------------------------------------------------------------+————+
> > >
> > > (interesting to note dir00 in the output)
> > >
> > > But the group by on the dir works as it is a column defined in the view
> > >
> > > 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
> > > dir0;
> > > +---------+---------+
> > > |  dir0   | EXPR$1  |
> > > +---------+---------+
> > > | month1  | 9000    |
> > > | month6  | 16000   |
> > > | month9  | 15131   |
> > > | month2  | 12000   |
> > > | month3  | 16000   |
> > > | month7  | 15573   |
> > > | month5  | 16000   |
> > > | month4  | 12000   |
> > > | month8  | 15390   |
> > > +---------+————+
> > >
> > > See if that works in your case.
> > >
> > >
> > > —Andries
> > >
> > >
> > > > On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
> > > >
> > > > Hey all,  to facilitate loading of some data from JSON to Parquet, I
> am
> > > > using the the load into "day" based directories...
> > > >
> > > > parqtable
> > > > |
> > > > |_______2015-11-01
> > > > |
> > > > |_______2015-11-02
> > > > |
> > > > |_______2015-11-03
> > > > |
> > > > |_______2015-11-04
> > > >
> > > > That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> > and
> > > > other cool tricks. It also helps my data loading.
> > > >
> > > > I am using the exact same query to load each day.
> > > >
> > > > CREATE TABLE `parqtable/2015-11-01' as
> > > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > > '2015-11-01')
> > > >
> > > > CREATE TABLE `parqtable/2015-11-02' as
> > > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > > '2015-11-02')
> > > >
> > > > CREATE TABLE `parqtable/2015-11-03' as
> > > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > > '2015-11-03')
> > > >
> > > > Etc
> > > >
> > > > This seams to work well except for one thing:
> > > >
> > > > If I want to see the count per directory, this (what I thought was
> > > obvious)
> > > > query:
> > > >
> > > > select dir0, count(*) from `parqtable` group by dir0
> > > >
> > > > fails with
> > > >
> > > > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> > > schema
> > > > changes
> > > >
> > > > Fragment: 2:8
> > > >
> > > >
> > > > I am not sure why this would be the case, the data is loaded by the
> > same
> > > > query, I would assume the schema is the same....
> > > >
> > > > Thoughts on how to troubleshoot?
> > > >
> > > > Thanks!
> > > >
> > > > John
> > >
> > >
> >
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by Jason Altekruse <al...@gmail.com>.

It is possible to introduce a schema change when you are writing parquet
from JSON files. If you read part of the JSON that only contains some
fields, and later see a new field, after we have written a few batches of
data to the parquet file, we will close the current file and open a new
one. Parquet currently requires that all columns be known when you start
writing.

Unfortunately this means that when you query the directory you would see
schema changes. This is my best guess as to why you are seeing this
behavior. We have a number of tasks outstanding around handling schema
change in different parts of the engine, so we will be working to remove
this restriction. If you need a workaround today, you could try to figure
out what the complete schema in your JSON will be and just put one record
at the top of each file that has data in each field that could possibly
appear in the dataset later. This will tell all of the JSON readers the
full schema at the start. There currently is no way to provide this
information to the reader without putting it in the JSON data itself.

On Thu, Nov 5, 2015 at 8:16 AM, John Omernik <jo...@omernik.com> wrote:

> I am getting the same error as before.  I am using the same query to create
> my tables, so I am unsure why it would see the schema as different. Is
> there way to view the schema for each directory to see what Drill thinks is
> happening?
>
> On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > Have you tried to create a view on top of parqtable, and then use the
> view?
> >
> > I did a quick experiment that may help you.
> >
> > Created a dir structure as follows (using csv files - what I had on hand)
> > orders
> > |____month1
> > |____month2
> > .
> > .
> > |____month9
> >
> > create or replace view dfs.views.ordersdir as select dir0, * from
> > dfs.data.`/orders`;
> >
> > 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> >
> >
> +---------+-------------------------------------------------------------------+---------+
> > |  dir0   |                              columns
> >     |  dir00  |
> >
> >
> +---------+-------------------------------------------------------------------+---------+
> > | month1  | ["8000","January","2014-01-02
> > 23:06:07","15349","ri","203","26"]  | month1  |
> > | month1  | ["8001","January","2014-01-31
> > 03:40:21","16033","oh","113","23"]  | month1  |
> >
> >
> +---------+-------------------------------------------------------------------+————+
> >
> > (interesting to note dir00 in the output)
> >
> > But the group by on the dir works as it is a column defined in the view
> >
> > 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
> > dir0;
> > +---------+---------+
> > |  dir0   | EXPR$1  |
> > +---------+---------+
> > | month1  | 9000    |
> > | month6  | 16000   |
> > | month9  | 15131   |
> > | month2  | 12000   |
> > | month3  | 16000   |
> > | month7  | 15573   |
> > | month5  | 16000   |
> > | month4  | 12000   |
> > | month8  | 15390   |
> > +---------+————+
> >
> > See if that works in your case.
> >
> >
> > —Andries
> >
> >
> > > On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
> > >
> > > Hey all,  to facilitate loading of some data from JSON to Parquet, I am
> > > using the the load into "day" based directories...
> > >
> > > parqtable
> > > |
> > > |_______2015-11-01
> > > |
> > > |_______2015-11-02
> > > |
> > > |_______2015-11-03
> > > |
> > > |_______2015-11-04
> > >
> > > That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> and
> > > other cool tricks. It also helps my data loading.
> > >
> > > I am using the exact same query to load each day.
> > >
> > > CREATE TABLE `parqtable/2015-11-01' as
> > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > '2015-11-01')
> > >
> > > CREATE TABLE `parqtable/2015-11-02' as
> > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > '2015-11-02')
> > >
> > > CREATE TABLE `parqtable/2015-11-03' as
> > > (select field1, field2, field3, field4 from jsontable where dir0 =
> > > '2015-11-03')
> > >
> > > Etc
> > >
> > > This seams to work well except for one thing:
> > >
> > > If I want to see the count per directory, this (what I thought was
> > obvious)
> > > query:
> > >
> > > select dir0, count(*) from `parqtable` group by dir0
> > >
> > > fails with
> > >
> > > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> > schema
> > > changes
> > >
> > > Fragment: 2:8
> > >
> > >
> > > I am not sure why this would be the case, the data is loaded by the
> same
> > > query, I would assume the schema is the same....
> > >
> > > Thoughts on how to troubleshoot?
> > >
> > > Thanks!
> > >
> > > John
> >
> >
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by John Omernik <jo...@omernik.com>.

This works:

select dir0, count(1) from table group by dir0

This does not:

select dir0, count(*) from table group by dir0

Thanks for the tip on count(1)

John


On Thu, Nov 5, 2015 at 7:40 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> In general, * is more complicated. However, count(*) is actually rewritten
> as count(1) and is generally the most efficient way to do a count. (Test
> reader might have a bug here which makes take longer (read all data instead
> of just row count).
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Nov 5, 2015 at 8:43 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > You can query each directory individually to see which one is
> troublesome.
> >
> > Also till the changes Jason is discussing is made to handle schema change
> > better I would suggest that you avoid * operations as a general rule of
> > thumb, unless needed.
> >
> > Since Drill has to manage schema on the fly * operations are a lot more
> > involved than a traditional RDBMS where schema is known and the optimizer
> > can easily manage it. A simple elect count(*) on a traditional RDBMS with
> > stats is a very simple operation, where in Drill it can be a challenge as
> > the schema is not necessarily known ahead of time.
> >
> > For this reason I normally suggest that people use select count() only on
> > a single field that is a simple data type and known not to change.
> > Typically a key or id field is the best to use, this way you sidestep
> > schema challenges and still get the result you want.
> >
> > select dir0, count(id_field) from parqtable group by dir0; will be a much
> > better option.
> >
> >
> > —Andries
> >
> > > On Nov 5, 2015, at 8:16 AM, John Omernik <jo...@omernik.com> wrote:
> > >
> > > I am getting the same error as before.  I am using the same query to
> > create
> > > my tables, so I am unsure why it would see the schema as different. Is
> > > there way to view the schema for each directory to see what Drill
> thinks
> > is
> > > happening?
> > >
> > > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > >> Have you tried to create a view on top of parqtable, and then use the
> > view?
> > >>
> > >> I did a quick experiment that may help you.
> > >>
> > >> Created a dir structure as follows (using csv files - what I had on
> > hand)
> > >> orders
> > >> |____month1
> > >> |____month2
> > >> .
> > >> .
> > >> |____month9
> > >>
> > >> create or replace view dfs.views.ordersdir as select dir0, * from
> > >> dfs.data.`/orders`;
> > >>
> > >> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> > >>
> > >>
> >
> +---------+-------------------------------------------------------------------+---------+
> > >> |  dir0   |                              columns
> > >>    |  dir00  |
> > >>
> > >>
> >
> +---------+-------------------------------------------------------------------+---------+
> > >> | month1  | ["8000","January","2014-01-02
> > >> 23:06:07","15349","ri","203","26"]  | month1  |
> > >> | month1  | ["8001","January","2014-01-31
> > >> 03:40:21","16033","oh","113","23"]  | month1  |
> > >>
> > >>
> >
> +---------+-------------------------------------------------------------------+————+
> > >>
> > >> (interesting to note dir00 in the output)
> > >>
> > >> But the group by on the dir works as it is a column defined in the
> view
> > >>
> > >> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group
> by
> > >> dir0;
> > >> +---------+---------+
> > >> |  dir0   | EXPR$1  |
> > >> +---------+---------+
> > >> | month1  | 9000    |
> > >> | month6  | 16000   |
> > >> | month9  | 15131   |
> > >> | month2  | 12000   |
> > >> | month3  | 16000   |
> > >> | month7  | 15573   |
> > >> | month5  | 16000   |
> > >> | month4  | 12000   |
> > >> | month8  | 15390   |
> > >> +---------+————+
> > >>
> > >> See if that works in your case.
> > >>
> > >>
> > >> —Andries
> > >>
> > >>
> > >>> On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
> > >>>
> > >>> Hey all,  to facilitate loading of some data from JSON to Parquet, I
> am
> > >>> using the the load into "day" based directories...
> > >>>
> > >>> parqtable
> > >>> |
> > >>> |_______2015-11-01
> > >>> |
> > >>> |_______2015-11-02
> > >>> |
> > >>> |_______2015-11-03
> > >>> |
> > >>> |_______2015-11-04
> > >>>
> > >>> That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> > and
> > >>> other cool tricks. It also helps my data loading.
> > >>>
> > >>> I am using the exact same query to load each day.
> > >>>
> > >>> CREATE TABLE `parqtable/2015-11-01' as
> > >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> > >>> '2015-11-01')
> > >>>
> > >>> CREATE TABLE `parqtable/2015-11-02' as
> > >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> > >>> '2015-11-02')
> > >>>
> > >>> CREATE TABLE `parqtable/2015-11-03' as
> > >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> > >>> '2015-11-03')
> > >>>
> > >>> Etc
> > >>>
> > >>> This seams to work well except for one thing:
> > >>>
> > >>> If I want to see the count per directory, this (what I thought was
> > >> obvious)
> > >>> query:
> > >>>
> > >>> select dir0, count(*) from `parqtable` group by dir0
> > >>>
> > >>> fails with
> > >>>
> > >>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> > >> schema
> > >>> changes
> > >>>
> > >>> Fragment: 2:8
> > >>>
> > >>>
> > >>> I am not sure why this would be the case, the data is loaded by the
> > same
> > >>> query, I would assume the schema is the same....
> > >>>
> > >>> Thoughts on how to troubleshoot?
> > >>>
> > >>> Thanks!
> > >>>
> > >>> John
> > >>
> > >>
> >
> >
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by Jacques Nadeau <ja...@dremio.com>.

In general, * is more complicated. However, count(*) is actually rewritten
as count(1) and is generally the most efficient way to do a count. (Test
reader might have a bug here which makes take longer (read all data instead
of just row count).

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Nov 5, 2015 at 8:43 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> You can query each directory individually to see which one is troublesome.
>
> Also till the changes Jason is discussing is made to handle schema change
> better I would suggest that you avoid * operations as a general rule of
> thumb, unless needed.
>
> Since Drill has to manage schema on the fly * operations are a lot more
> involved than a traditional RDBMS where schema is known and the optimizer
> can easily manage it. A simple elect count(*) on a traditional RDBMS with
> stats is a very simple operation, where in Drill it can be a challenge as
> the schema is not necessarily known ahead of time.
>
> For this reason I normally suggest that people use select count() only on
> a single field that is a simple data type and known not to change.
> Typically a key or id field is the best to use, this way you sidestep
> schema challenges and still get the result you want.
>
> select dir0, count(id_field) from parqtable group by dir0; will be a much
> better option.
>
>
> —Andries
>
> > On Nov 5, 2015, at 8:16 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > I am getting the same error as before.  I am using the same query to
> create
> > my tables, so I am unsure why it would see the schema as different. Is
> > there way to view the schema for each directory to see what Drill thinks
> is
> > happening?
> >
> > On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> >> Have you tried to create a view on top of parqtable, and then use the
> view?
> >>
> >> I did a quick experiment that may help you.
> >>
> >> Created a dir structure as follows (using csv files - what I had on
> hand)
> >> orders
> >> |____month1
> >> |____month2
> >> .
> >> .
> >> |____month9
> >>
> >> create or replace view dfs.views.ordersdir as select dir0, * from
> >> dfs.data.`/orders`;
> >>
> >> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
> >>
> >>
> +---------+-------------------------------------------------------------------+---------+
> >> |  dir0   |                              columns
> >>    |  dir00  |
> >>
> >>
> +---------+-------------------------------------------------------------------+---------+
> >> | month1  | ["8000","January","2014-01-02
> >> 23:06:07","15349","ri","203","26"]  | month1  |
> >> | month1  | ["8001","January","2014-01-31
> >> 03:40:21","16033","oh","113","23"]  | month1  |
> >>
> >>
> +---------+-------------------------------------------------------------------+————+
> >>
> >> (interesting to note dir00 in the output)
> >>
> >> But the group by on the dir works as it is a column defined in the view
> >>
> >> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
> >> dir0;
> >> +---------+---------+
> >> |  dir0   | EXPR$1  |
> >> +---------+---------+
> >> | month1  | 9000    |
> >> | month6  | 16000   |
> >> | month9  | 15131   |
> >> | month2  | 12000   |
> >> | month3  | 16000   |
> >> | month7  | 15573   |
> >> | month5  | 16000   |
> >> | month4  | 12000   |
> >> | month8  | 15390   |
> >> +---------+————+
> >>
> >> See if that works in your case.
> >>
> >>
> >> —Andries
> >>
> >>
> >>> On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
> >>>
> >>> Hey all,  to facilitate loading of some data from JSON to Parquet, I am
> >>> using the the load into "day" based directories...
> >>>
> >>> parqtable
> >>> |
> >>> |_______2015-11-01
> >>> |
> >>> |_______2015-11-02
> >>> |
> >>> |_______2015-11-03
> >>> |
> >>> |_______2015-11-04
> >>>
> >>> That way I can do select * from `parqtable` where dir0 = '2015-11-01'
> and
> >>> other cool tricks. It also helps my data loading.
> >>>
> >>> I am using the exact same query to load each day.
> >>>
> >>> CREATE TABLE `parqtable/2015-11-01' as
> >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> >>> '2015-11-01')
> >>>
> >>> CREATE TABLE `parqtable/2015-11-02' as
> >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> >>> '2015-11-02')
> >>>
> >>> CREATE TABLE `parqtable/2015-11-03' as
> >>> (select field1, field2, field3, field4 from jsontable where dir0 =
> >>> '2015-11-03')
> >>>
> >>> Etc
> >>>
> >>> This seams to work well except for one thing:
> >>>
> >>> If I want to see the count per directory, this (what I thought was
> >> obvious)
> >>> query:
> >>>
> >>> select dir0, count(*) from `parqtable` group by dir0
> >>>
> >>> fails with
> >>>
> >>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> >> schema
> >>> changes
> >>>
> >>> Fragment: 2:8
> >>>
> >>>
> >>> I am not sure why this would be the case, the data is loaded by the
> same
> >>> query, I would assume the schema is the same....
> >>>
> >>> Thoughts on how to troubleshoot?
> >>>
> >>> Thanks!
> >>>
> >>> John
> >>
> >>
>
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by Andries Engelbrecht <ae...@maprtech.com>.

You can query each directory individually to see which one is troublesome.

Also till the changes Jason is discussing is made to handle schema change better I would suggest that you avoid * operations as a general rule of thumb, unless needed.

Since Drill has to manage schema on the fly * operations are a lot more involved than a traditional RDBMS where schema is known and the optimizer can easily manage it. A simple elect count(*) on a traditional RDBMS with stats is a very simple operation, where in Drill it can be a challenge as the schema is not necessarily known ahead of time.

For this reason I normally suggest that people use select count() only on a single field that is a simple data type and known not to change. Typically a key or id field is the best to use, this way you sidestep schema challenges and still get the result you want.

select dir0, count(id_field) from parqtable group by dir0; will be a much better option.


—Andries

> On Nov 5, 2015, at 8:16 AM, John Omernik <jo...@omernik.com> wrote:
> 
> I am getting the same error as before.  I am using the same query to create
> my tables, so I am unsure why it would see the schema as different. Is
> there way to view the schema for each directory to see what Drill thinks is
> happening?
> 
> On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> Have you tried to create a view on top of parqtable, and then use the view?
>> 
>> I did a quick experiment that may help you.
>> 
>> Created a dir structure as follows (using csv files - what I had on hand)
>> orders
>> |____month1
>> |____month2
>> .
>> .
>> |____month9
>> 
>> create or replace view dfs.views.ordersdir as select dir0, * from
>> dfs.data.`/orders`;
>> 
>> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
>> 
>> +---------+-------------------------------------------------------------------+---------+
>> |  dir0   |                              columns
>>    |  dir00  |
>> 
>> +---------+-------------------------------------------------------------------+---------+
>> | month1  | ["8000","January","2014-01-02
>> 23:06:07","15349","ri","203","26"]  | month1  |
>> | month1  | ["8001","January","2014-01-31
>> 03:40:21","16033","oh","113","23"]  | month1  |
>> 
>> +---------+-------------------------------------------------------------------+————+
>> 
>> (interesting to note dir00 in the output)
>> 
>> But the group by on the dir works as it is a column defined in the view
>> 
>> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
>> dir0;
>> +---------+---------+
>> |  dir0   | EXPR$1  |
>> +---------+---------+
>> | month1  | 9000    |
>> | month6  | 16000   |
>> | month9  | 15131   |
>> | month2  | 12000   |
>> | month3  | 16000   |
>> | month7  | 15573   |
>> | month5  | 16000   |
>> | month4  | 12000   |
>> | month8  | 15390   |
>> +---------+————+
>> 
>> See if that works in your case.
>> 
>> 
>> —Andries
>> 
>> 
>>> On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
>>> 
>>> Hey all,  to facilitate loading of some data from JSON to Parquet, I am
>>> using the the load into "day" based directories...
>>> 
>>> parqtable
>>> |
>>> |_______2015-11-01
>>> |
>>> |_______2015-11-02
>>> |
>>> |_______2015-11-03
>>> |
>>> |_______2015-11-04
>>> 
>>> That way I can do select * from `parqtable` where dir0 = '2015-11-01' and
>>> other cool tricks. It also helps my data loading.
>>> 
>>> I am using the exact same query to load each day.
>>> 
>>> CREATE TABLE `parqtable/2015-11-01' as
>>> (select field1, field2, field3, field4 from jsontable where dir0 =
>>> '2015-11-01')
>>> 
>>> CREATE TABLE `parqtable/2015-11-02' as
>>> (select field1, field2, field3, field4 from jsontable where dir0 =
>>> '2015-11-02')
>>> 
>>> CREATE TABLE `parqtable/2015-11-03' as
>>> (select field1, field2, field3, field4 from jsontable where dir0 =
>>> '2015-11-03')
>>> 
>>> Etc
>>> 
>>> This seams to work well except for one thing:
>>> 
>>> If I want to see the count per directory, this (what I thought was
>> obvious)
>>> query:
>>> 
>>> select dir0, count(*) from `parqtable` group by dir0
>>> 
>>> fails with
>>> 
>>> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
>> schema
>>> changes
>>> 
>>> Fragment: 2:8
>>> 
>>> 
>>> I am not sure why this would be the case, the data is loaded by the same
>>> query, I would assume the schema is the same....
>>> 
>>> Thoughts on how to troubleshoot?
>>> 
>>> Thanks!
>>> 
>>> John
>> 
>>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by John Omernik <jo...@omernik.com>.

I am getting the same error as before.  I am using the same query to create
my tables, so I am unsure why it would see the schema as different. Is
there way to view the schema for each directory to see what Drill thinks is
happening?

On Thu, Nov 5, 2015 at 9:35 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Have you tried to create a view on top of parqtable, and then use the view?
>
> I did a quick experiment that may help you.
>
> Created a dir structure as follows (using csv files - what I had on hand)
> orders
> |____month1
> |____month2
> .
> .
> |____month9
>
> create or replace view dfs.views.ordersdir as select dir0, * from
> dfs.data.`/orders`;
>
> 0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
>
> +---------+-------------------------------------------------------------------+---------+
> |  dir0   |                              columns
>     |  dir00  |
>
> +---------+-------------------------------------------------------------------+---------+
> | month1  | ["8000","January","2014-01-02
> 23:06:07","15349","ri","203","26"]  | month1  |
> | month1  | ["8001","January","2014-01-31
> 03:40:21","16033","oh","113","23"]  | month1  |
>
> +---------+-------------------------------------------------------------------+————+
>
> (interesting to note dir00 in the output)
>
> But the group by on the dir works as it is a column defined in the view
>
> 0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by
> dir0;
> +---------+---------+
> |  dir0   | EXPR$1  |
> +---------+---------+
> | month1  | 9000    |
> | month6  | 16000   |
> | month9  | 15131   |
> | month2  | 12000   |
> | month3  | 16000   |
> | month7  | 15573   |
> | month5  | 16000   |
> | month4  | 12000   |
> | month8  | 15390   |
> +---------+————+
>
> See if that works in your case.
>
>
> —Andries
>
>
> > On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > Hey all,  to facilitate loading of some data from JSON to Parquet, I am
> > using the the load into "day" based directories...
> >
> > parqtable
> > |
> > |_______2015-11-01
> > |
> > |_______2015-11-02
> > |
> > |_______2015-11-03
> > |
> > |_______2015-11-04
> >
> > That way I can do select * from `parqtable` where dir0 = '2015-11-01' and
> > other cool tricks. It also helps my data loading.
> >
> > I am using the exact same query to load each day.
> >
> > CREATE TABLE `parqtable/2015-11-01' as
> > (select field1, field2, field3, field4 from jsontable where dir0 =
> > '2015-11-01')
> >
> > CREATE TABLE `parqtable/2015-11-02' as
> > (select field1, field2, field3, field4 from jsontable where dir0 =
> > '2015-11-02')
> >
> > CREATE TABLE `parqtable/2015-11-03' as
> > (select field1, field2, field3, field4 from jsontable where dir0 =
> > '2015-11-03')
> >
> > Etc
> >
> > This seams to work well except for one thing:
> >
> > If I want to see the count per directory, this (what I thought was
> obvious)
> > query:
> >
> > select dir0, count(*) from `parqtable` group by dir0
> >
> > fails with
> >
> > Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support
> schema
> > changes
> >
> > Fragment: 2:8
> >
> >
> > I am not sure why this would be the case, the data is loaded by the same
> > query, I would assume the schema is the same....
> >
> > Thoughts on how to troubleshoot?
> >
> > Thanks!
> >
> > John
>
>

Re: Parquet Files Loaded to "directory" partitions Error on Grouping by dir0

Posted by Andries Engelbrecht <ae...@maprtech.com>.

Have you tried to create a view on top of parqtable, and then use the view?

I did a quick experiment that may help you.

Created a dir structure as follows (using csv files - what I had on hand)
orders
|____month1
|____month2
.
.
|____month9

create or replace view dfs.views.ordersdir as select dir0, * from dfs.data.`/orders`;

0: jdbc:drill:> select * from dfs.views.ordersdir limit 2;
+---------+-------------------------------------------------------------------+---------+
|  dir0   |                              columns                              |  dir00  |
+---------+-------------------------------------------------------------------+---------+
| month1  | ["8000","January","2014-01-02 23:06:07","15349","ri","203","26"]  | month1  |
| month1  | ["8001","January","2014-01-31 03:40:21","16033","oh","113","23"]  | month1  |
+---------+-------------------------------------------------------------------+————+

(interesting to note dir00 in the output)

But the group by on the dir works as it is a column defined in the view

0: jdbc:drill:> select dir0, count(*) from dfs.views.ordersdir group by dir0;
+---------+---------+
|  dir0   | EXPR$1  |
+---------+---------+
| month1  | 9000    |
| month6  | 16000   |
| month9  | 15131   |
| month2  | 12000   |
| month3  | 16000   |
| month7  | 15573   |
| month5  | 16000   |
| month4  | 12000   |
| month8  | 15390   |
+---------+————+

See if that works in your case.


—Andries


> On Nov 5, 2015, at 6:48 AM, John Omernik <jo...@omernik.com> wrote:
> 
> Hey all,  to facilitate loading of some data from JSON to Parquet, I am
> using the the load into "day" based directories...
> 
> parqtable
> |
> |_______2015-11-01
> |
> |_______2015-11-02
> |
> |_______2015-11-03
> |
> |_______2015-11-04
> 
> That way I can do select * from `parqtable` where dir0 = '2015-11-01' and
> other cool tricks. It also helps my data loading.
> 
> I am using the exact same query to load each day.
> 
> CREATE TABLE `parqtable/2015-11-01' as
> (select field1, field2, field3, field4 from jsontable where dir0 =
> '2015-11-01')
> 
> CREATE TABLE `parqtable/2015-11-02' as
> (select field1, field2, field3, field4 from jsontable where dir0 =
> '2015-11-02')
> 
> CREATE TABLE `parqtable/2015-11-03' as
> (select field1, field2, field3, field4 from jsontable where dir0 =
> '2015-11-03')
> 
> Etc
> 
> This seams to work well except for one thing:
> 
> If I want to see the count per directory, this (what I thought was obvious)
> query:
> 
> select dir0, count(*) from `parqtable` group by dir0
> 
> fails with
> 
> Error: UNSUPPORTED_OPERATION_ ERROR: Hash aggregate does not support schema
> changes
> 
> Fragment: 2:8
> 
> 
> I am not sure why this would be the case, the data is loaded by the same
> query, I would assume the schema is the same....
> 
> Thoughts on how to troubleshoot?
> 
> Thanks!
> 
> John