You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2017/06/03 13:19:45 UTC

A possible regression 1.9 / 1.10 when querying Parquet with complex types /nested structures (Map)

Hi,

I have a sample data set (a few million records) that is saved to parquet
in 2 ways. A simple file structure with primary types to store dimensions
and metrics (String, Double) and a using nested maps (String,String and
String,Double) respectively.

Querying the data set with the simple types only:

select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(metrics_price) as
price, sum(metrics_kwh) as kwh from
dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
group by roundTimeStamp(s.occurred_at,'PT1H')


takes: *28.442 *sec. (dev. laptop x 1)


Same query against the nested structure:

select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(s.metrics.price)
as price, sum(s.metricss.kwh) as kwh from
dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
group by roundTimeStamp(s.occurred_at,'PT1H')

takes: *719.810* sec.

Event counting the number of records takes very, very long if there is a
nested structure involved. (select count(*) from)
It does not behave like this on our production servers (1.8) put I have not
run this particular test on them (their performance has never been an
issue)
I have these sample files available if anyone wishes to reproduces this
consistently.
Regards,
 -Stefán

Re: A possible regression 1.9 / 1.10 when querying Parquet with complex types /nested structures (Map)

Posted by Stefán Baxter <st...@activitystream.com>.
Ok, the data is a bit sensitive.

I'll submit this when I have created a meaningful test set that I can
distribute.

- Stefán

On Sun, Jun 4, 2017 at 6:54 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Jira is always the preferrable approach. Thank You.
>
> On Sat, Jun 3, 2017 at 1:38 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi Rahul,
> >
> > Sure, but can I perhaps get the files to you directly?
> >
> > Regards,
> >  -Stefán
> >
> > On Sat, Jun 3, 2017 at 8:13 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Can you please raise a jira and attach the required files? I can try to
> > > reproduce it.
> > >
> > > Rahul
> > >
> > > On Jun 3, 2017 6:19 AM, "Stefán Baxter" <st...@activitystream.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a sample data set (a few million records) that is saved to
> > parquet
> > > > in 2 ways. A simple file structure with primary types to store
> > dimensions
> > > > and metrics (String, Double) and a using nested maps (String,String
> and
> > > > String,Double) respectively.
> > > >
> > > > Querying the data set with the simple types only:
> > > >
> > > > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> > sum(metrics_price)
> > > as
> > > > price, sum(metrics_kwh) as kwh from
> > > > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*`
> as s
> > > > group by roundTimeStamp(s.occurred_at,'PT1H')
> > > >
> > > >
> > > > takes: *28.442 *sec. (dev. laptop x 1)
> > > >
> > > >
> > > > Same query against the nested structure:
> > > >
> > > > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> > > sum(s.metrics.price)
> > > > as price, sum(s.metricss.kwh) as kwh from
> > > > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*`
> as s
> > > > group by roundTimeStamp(s.occurred_at,'PT1H')
> > > >
> > > > takes: *719.810* sec.
> > > >
> > > > Event counting the number of records takes very, very long if there
> is
> > a
> > > > nested structure involved. (select count(*) from)
> > > > It does not behave like this on our production servers (1.8) put I
> have
> > > not
> > > > run this particular test on them (their performance has never been an
> > > > issue)
> > > > I have these sample files available if anyone wishes to reproduces
> this
> > > > consistently.
> > > > Regards,
> > > >  -Stefán
> > > >
> > >
> >
>

Re: A possible regression 1.9 / 1.10 when querying Parquet with complex types /nested structures (Map)

Posted by rahul challapalli <ch...@gmail.com>.
Jira is always the preferrable approach. Thank You.

On Sat, Jun 3, 2017 at 1:38 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi Rahul,
>
> Sure, but can I perhaps get the files to you directly?
>
> Regards,
>  -Stefán
>
> On Sat, Jun 3, 2017 at 8:13 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Can you please raise a jira and attach the required files? I can try to
> > reproduce it.
> >
> > Rahul
> >
> > On Jun 3, 2017 6:19 AM, "Stefán Baxter" <st...@activitystream.com>
> wrote:
> >
> > > Hi,
> > >
> > > I have a sample data set (a few million records) that is saved to
> parquet
> > > in 2 ways. A simple file structure with primary types to store
> dimensions
> > > and metrics (String, Double) and a using nested maps (String,String and
> > > String,Double) respectively.
> > >
> > > Querying the data set with the simple types only:
> > >
> > > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> sum(metrics_price)
> > as
> > > price, sum(metrics_kwh) as kwh from
> > > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
> > > group by roundTimeStamp(s.occurred_at,'PT1H')
> > >
> > >
> > > takes: *28.442 *sec. (dev. laptop x 1)
> > >
> > >
> > > Same query against the nested structure:
> > >
> > > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> > sum(s.metrics.price)
> > > as price, sum(s.metricss.kwh) as kwh from
> > > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
> > > group by roundTimeStamp(s.occurred_at,'PT1H')
> > >
> > > takes: *719.810* sec.
> > >
> > > Event counting the number of records takes very, very long if there is
> a
> > > nested structure involved. (select count(*) from)
> > > It does not behave like this on our production servers (1.8) put I have
> > not
> > > run this particular test on them (their performance has never been an
> > > issue)
> > > I have these sample files available if anyone wishes to reproduces this
> > > consistently.
> > > Regards,
> > >  -Stefán
> > >
> >
>

Re: A possible regression 1.9 / 1.10 when querying Parquet with complex types /nested structures (Map)

Posted by Stefán Baxter <st...@activitystream.com>.
Hi Rahul,

Sure, but can I perhaps get the files to you directly?

Regards,
 -Stefán

On Sat, Jun 3, 2017 at 8:13 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Can you please raise a jira and attach the required files? I can try to
> reproduce it.
>
> Rahul
>
> On Jun 3, 2017 6:19 AM, "Stefán Baxter" <st...@activitystream.com> wrote:
>
> > Hi,
> >
> > I have a sample data set (a few million records) that is saved to parquet
> > in 2 ways. A simple file structure with primary types to store dimensions
> > and metrics (String, Double) and a using nested maps (String,String and
> > String,Double) respectively.
> >
> > Querying the data set with the simple types only:
> >
> > select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(metrics_price)
> as
> > price, sum(metrics_kwh) as kwh from
> > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
> > group by roundTimeStamp(s.occurred_at,'PT1H')
> >
> >
> > takes: *28.442 *sec. (dev. laptop x 1)
> >
> >
> > Same query against the nested structure:
> >
> > select roundTimeStamp(s.occurred_at,'PT1H') as `at`,
> sum(s.metrics.price)
> > as price, sum(s.metricss.kwh) as kwh from
> > dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
> > group by roundTimeStamp(s.occurred_at,'PT1H')
> >
> > takes: *719.810* sec.
> >
> > Event counting the number of records takes very, very long if there is a
> > nested structure involved. (select count(*) from)
> > It does not behave like this on our production servers (1.8) put I have
> not
> > run this particular test on them (their performance has never been an
> > issue)
> > I have these sample files available if anyone wishes to reproduces this
> > consistently.
> > Regards,
> >  -Stefán
> >
>

Re: A possible regression 1.9 / 1.10 when querying Parquet with complex types /nested structures (Map)

Posted by rahul challapalli <ch...@gmail.com>.
Can you please raise a jira and attach the required files? I can try to
reproduce it.

Rahul

On Jun 3, 2017 6:19 AM, "Stefán Baxter" <st...@activitystream.com> wrote:

> Hi,
>
> I have a sample data set (a few million records) that is saved to parquet
> in 2 ways. A simple file structure with primary types to store dimensions
> and metrics (String, Double) and a using nested maps (String,String and
> String,Double) respectively.
>
> Querying the data set with the simple types only:
>
> select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(metrics_price) as
> price, sum(metrics_kwh) as kwh from
> dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
> group by roundTimeStamp(s.occurred_at,'PT1H')
>
>
> takes: *28.442 *sec. (dev. laptop x 1)
>
>
> Same query against the nested structure:
>
> select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(s.metrics.price)
> as price, sum(s.metricss.kwh) as kwh from
> dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
> group by roundTimeStamp(s.occurred_at,'PT1H')
>
> takes: *719.810* sec.
>
> Event counting the number of records takes very, very long if there is a
> nested structure involved. (select count(*) from)
> It does not behave like this on our production servers (1.8) put I have not
> run this particular test on them (their performance has never been an
> issue)
> I have these sample files available if anyone wishes to reproduces this
> consistently.
> Regards,
>  -Stefán
>