You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/09/17 19:42:09 UTC

directory pruning and UDFs

Hi,

I have been writing a few simple utility functions for Drill and staring at
the cumbersome dirN conditions required to take advantage of directory
pruning.

Would it be possible to allow UDFs to throw fileOutOfScope and
directoryOutOfScope exceptions that would allow me to a) write a failry
clever inRange(from, to, dirN...) function and would b) allow for
additional pruning during execution?

Maybe I'm seeing this all wrong but the process of complicating all queries
with a, sometimes quite complicated, dirN tail just seems like too much
redundancy.

Regards,
 -Stefan

Re: directory pruning and UDFs

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

It's here: https://issues.apache.org/jira/browse/DRILL-3838

hopefully this can be accommodated soon :).

Regards,
 -Stefan



On Wed, Sep 23, 2015 at 5:21 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Hey Stefan,
>
> Yes, this makes a lot of sense and seems reasonable. We've talked about
> providing the simple filename as a virtual attribute. It seems like we
> should also provide a full path attribute (from the root of the workspace).
> Can you open a JIRA for this? It isn't something that is supported now but
> should be fairly trivial to do while we are adding the filename virtual
> attribute.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Sep 22, 2015 at 1:51 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Jacques,
> >
> > Is this something you think makes sense and could be accommodated?
> >
> > Regards,
> >  -Stefan
> >
> > On Fri, Sep 18, 2015 at 12:13 PM, Stefán Baxter <
> stefan@activitystream.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > The short back story is this:
> > >
> > >    - We are serving multiple tenants with vastly different data volume
> > >    and needs
> > >    - there no such thing as fixed period segment sizes (to get to
> approx.
> > >    volume per segment)
> > >
> > >    - We do queries that combined information from historical and fresh
> > >    (streaming) data (parquet and json/avro respectively) using joins
> > >    - currently we are using loggers to emit the streaming data but this
> > >    will replaced
> > >
> > >    - The "fresh" data (json/avro)  files live in a single directory
> > >    - 1 file per day
> > >
> > >    - Fresh data is occasionally transformed from json/avro to parquet
> > >    - the frequency of this is set on tenant/volume basis
> > >
> > > This is why we need/like to*:
> > >
> > >    - Use directory structure and file names as a flexible chronological
> > >    partitions (via UDFs)
> > >    - Use parquet partitions for "logical data separation" based on
> other
> > >    attributes than time
> > >
> > >    * Please remember that adding new data to parquet files would
> > >    eliminate the need for much of this
> > >    ** The same is true if would move this whole thing to some metadata
> > >    driven environment like Hive
> > >
> > > The Historical (parquet) directory structure might look something like
> > > this:
> > >
> > >    1. /<tenant>/<source>/streaming/2015/09/10
> > >    - high volume :: data transformed daily
> > >
> > >    2. /<tenant>/<source>/streaming/2015/W10
> > >    - medium volume :: data transformed weekly
> > >
> > >    3. /<tenant>/<source>/streaming/2015/09
> > >    - low(er) volume :: data transformed monthly
> > >
> > > So yes, we think that having the ability to evaluate full paths and
> file
> > > names where we can affect the pruning/scanning with appropriate
> > exceptions
> > > would help us gain some sanity :).
> > >
> > > I realize that pruning should preferably be done in the planning phase
> > but
> > > this would allow for a not-too-messy interception of the scanning
> > process.
> > >
> > > Best regards,
> > >  -Stefan
> > >
> > >
> > > On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <ja...@dremio.com>
> > > wrote:
> > >
> > >> Can you also provide some examples of what you are trying to
> accomplish?
> > >>
> > >> It seems like you might be saying that you want a virtual attribute
> for
> > >> the
> > >> entire path rather than individual pieces? Also remember that
> partition
> > >> pruning can also be done if you're using Parquet files without all the
> > >> dirN
> > >> syntax.
> > >>
> > >> --
> > >> Jacques Nadeau
> > >> CTO and Co-Founder, Dremio
> > >>
> > >> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <
> > >> stefan@activitystream.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I have been writing a few simple utility functions for Drill and
> > >> staring at
> > >> > the cumbersome dirN conditions required to take advantage of
> directory
> > >> > pruning.
> > >> >
> > >> > Would it be possible to allow UDFs to throw fileOutOfScope and
> > >> > directoryOutOfScope exceptions that would allow me to a) write a
> > failry
> > >> > clever inRange(from, to, dirN...) function and would b) allow for
> > >> > additional pruning during execution?
> > >> >
> > >> > Maybe I'm seeing this all wrong but the process of complicating all
> > >> queries
> > >> > with a, sometimes quite complicated, dirN tail just seems like too
> > much
> > >> > redundancy.
> > >> >
> > >> > Regards,
> > >> >  -Stefan
> > >> >
> > >>
> > >
> > >
> >
>

Re: directory pruning and UDFs

Posted by Jacques Nadeau <ja...@dremio.com>.
Hey Stefan,

Yes, this makes a lot of sense and seems reasonable. We've talked about
providing the simple filename as a virtual attribute. It seems like we
should also provide a full path attribute (from the root of the workspace).
Can you open a JIRA for this? It isn't something that is supported now but
should be fairly trivial to do while we are adding the filename virtual
attribute.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Sep 22, 2015 at 1:51 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Jacques,
>
> Is this something you think makes sense and could be accommodated?
>
> Regards,
>  -Stefan
>
> On Fri, Sep 18, 2015 at 12:13 PM, Stefán Baxter <stefan@activitystream.com
> >
> wrote:
>
> > Hi,
> >
> > The short back story is this:
> >
> >    - We are serving multiple tenants with vastly different data volume
> >    and needs
> >    - there no such thing as fixed period segment sizes (to get to approx.
> >    volume per segment)
> >
> >    - We do queries that combined information from historical and fresh
> >    (streaming) data (parquet and json/avro respectively) using joins
> >    - currently we are using loggers to emit the streaming data but this
> >    will replaced
> >
> >    - The "fresh" data (json/avro)  files live in a single directory
> >    - 1 file per day
> >
> >    - Fresh data is occasionally transformed from json/avro to parquet
> >    - the frequency of this is set on tenant/volume basis
> >
> > This is why we need/like to*:
> >
> >    - Use directory structure and file names as a flexible chronological
> >    partitions (via UDFs)
> >    - Use parquet partitions for "logical data separation" based on other
> >    attributes than time
> >
> >    * Please remember that adding new data to parquet files would
> >    eliminate the need for much of this
> >    ** The same is true if would move this whole thing to some metadata
> >    driven environment like Hive
> >
> > The Historical (parquet) directory structure might look something like
> > this:
> >
> >    1. /<tenant>/<source>/streaming/2015/09/10
> >    - high volume :: data transformed daily
> >
> >    2. /<tenant>/<source>/streaming/2015/W10
> >    - medium volume :: data transformed weekly
> >
> >    3. /<tenant>/<source>/streaming/2015/09
> >    - low(er) volume :: data transformed monthly
> >
> > So yes, we think that having the ability to evaluate full paths and file
> > names where we can affect the pruning/scanning with appropriate
> exceptions
> > would help us gain some sanity :).
> >
> > I realize that pruning should preferably be done in the planning phase
> but
> > this would allow for a not-too-messy interception of the scanning
> process.
> >
> > Best regards,
> >  -Stefan
> >
> >
> > On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <ja...@dremio.com>
> > wrote:
> >
> >> Can you also provide some examples of what you are trying to accomplish?
> >>
> >> It seems like you might be saying that you want a virtual attribute for
> >> the
> >> entire path rather than individual pieces? Also remember that partition
> >> pruning can also be done if you're using Parquet files without all the
> >> dirN
> >> syntax.
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <
> >> stefan@activitystream.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have been writing a few simple utility functions for Drill and
> >> staring at
> >> > the cumbersome dirN conditions required to take advantage of directory
> >> > pruning.
> >> >
> >> > Would it be possible to allow UDFs to throw fileOutOfScope and
> >> > directoryOutOfScope exceptions that would allow me to a) write a
> failry
> >> > clever inRange(from, to, dirN...) function and would b) allow for
> >> > additional pruning during execution?
> >> >
> >> > Maybe I'm seeing this all wrong but the process of complicating all
> >> queries
> >> > with a, sometimes quite complicated, dirN tail just seems like too
> much
> >> > redundancy.
> >> >
> >> > Regards,
> >> >  -Stefan
> >> >
> >>
> >
> >
>

Re: directory pruning and UDFs

Posted by Stefán Baxter <st...@activitystream.com>.
Jacques,

Is this something you think makes sense and could be accommodated?

Regards,
 -Stefan

On Fri, Sep 18, 2015 at 12:13 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> The short back story is this:
>
>    - We are serving multiple tenants with vastly different data volume
>    and needs
>    - there no such thing as fixed period segment sizes (to get to approx.
>    volume per segment)
>
>    - We do queries that combined information from historical and fresh
>    (streaming) data (parquet and json/avro respectively) using joins
>    - currently we are using loggers to emit the streaming data but this
>    will replaced
>
>    - The "fresh" data (json/avro)  files live in a single directory
>    - 1 file per day
>
>    - Fresh data is occasionally transformed from json/avro to parquet
>    - the frequency of this is set on tenant/volume basis
>
> This is why we need/like to*:
>
>    - Use directory structure and file names as a flexible chronological
>    partitions (via UDFs)
>    - Use parquet partitions for "logical data separation" based on other
>    attributes than time
>
>    * Please remember that adding new data to parquet files would
>    eliminate the need for much of this
>    ** The same is true if would move this whole thing to some metadata
>    driven environment like Hive
>
> The Historical (parquet) directory structure might look something like
> this:
>
>    1. /<tenant>/<source>/streaming/2015/09/10
>    - high volume :: data transformed daily
>
>    2. /<tenant>/<source>/streaming/2015/W10
>    - medium volume :: data transformed weekly
>
>    3. /<tenant>/<source>/streaming/2015/09
>    - low(er) volume :: data transformed monthly
>
> So yes, we think that having the ability to evaluate full paths and file
> names where we can affect the pruning/scanning with appropriate exceptions
> would help us gain some sanity :).
>
> I realize that pruning should preferably be done in the planning phase but
> this would allow for a not-too-messy interception of the scanning process.
>
> Best regards,
>  -Stefan
>
>
> On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
>> Can you also provide some examples of what you are trying to accomplish?
>>
>> It seems like you might be saying that you want a virtual attribute for
>> the
>> entire path rather than individual pieces? Also remember that partition
>> pruning can also be done if you're using Parquet files without all the
>> dirN
>> syntax.
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <
>> stefan@activitystream.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I have been writing a few simple utility functions for Drill and
>> staring at
>> > the cumbersome dirN conditions required to take advantage of directory
>> > pruning.
>> >
>> > Would it be possible to allow UDFs to throw fileOutOfScope and
>> > directoryOutOfScope exceptions that would allow me to a) write a failry
>> > clever inRange(from, to, dirN...) function and would b) allow for
>> > additional pruning during execution?
>> >
>> > Maybe I'm seeing this all wrong but the process of complicating all
>> queries
>> > with a, sometimes quite complicated, dirN tail just seems like too much
>> > redundancy.
>> >
>> > Regards,
>> >  -Stefan
>> >
>>
>
>

Re: directory pruning and UDFs

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

The short back story is this:

   - We are serving multiple tenants with vastly different data volume and
   needs
   - there no such thing as fixed period segment sizes (to get to approx.
   volume per segment)

   - We do queries that combined information from historical and fresh
   (streaming) data (parquet and json/avro respectively) using joins
   - currently we are using loggers to emit the streaming data but this
   will replaced

   - The "fresh" data (json/avro)  files live in a single directory
   - 1 file per day

   - Fresh data is occasionally transformed from json/avro to parquet
   - the frequency of this is set on tenant/volume basis

This is why we need/like to*:

   - Use directory structure and file names as a flexible chronological
   partitions (via UDFs)
   - Use parquet partitions for "logical data separation" based on other
   attributes than time

   * Please remember that adding new data to parquet files would eliminate
   the need for much of this
   ** The same is true if would move this whole thing to some metadata
   driven environment like Hive

The Historical (parquet) directory structure might look something like this:

   1. /<tenant>/<source>/streaming/2015/09/10
   - high volume :: data transformed daily

   2. /<tenant>/<source>/streaming/2015/W10
   - medium volume :: data transformed weekly

   3. /<tenant>/<source>/streaming/2015/09
   - low(er) volume :: data transformed monthly

So yes, we think that having the ability to evaluate full paths and file
names where we can affect the pruning/scanning with appropriate exceptions
would help us gain some sanity :).

I realize that pruning should preferably be done in the planning phase but
this would allow for a not-too-messy interception of the scanning process.

Best regards,
 -Stefan


On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <ja...@dremio.com> wrote:

> Can you also provide some examples of what you are trying to accomplish?
>
> It seems like you might be saying that you want a virtual attribute for the
> entire path rather than individual pieces? Also remember that partition
> pruning can also be done if you're using Parquet files without all the dirN
> syntax.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <stefan@activitystream.com
> >
> wrote:
>
> > Hi,
> >
> > I have been writing a few simple utility functions for Drill and staring
> at
> > the cumbersome dirN conditions required to take advantage of directory
> > pruning.
> >
> > Would it be possible to allow UDFs to throw fileOutOfScope and
> > directoryOutOfScope exceptions that would allow me to a) write a failry
> > clever inRange(from, to, dirN...) function and would b) allow for
> > additional pruning during execution?
> >
> > Maybe I'm seeing this all wrong but the process of complicating all
> queries
> > with a, sometimes quite complicated, dirN tail just seems like too much
> > redundancy.
> >
> > Regards,
> >  -Stefan
> >
>

Re: directory pruning and UDFs

Posted by Jacques Nadeau <ja...@dremio.com>.
Can you also provide some examples of what you are trying to accomplish?

It seems like you might be saying that you want a virtual attribute for the
entire path rather than individual pieces? Also remember that partition
pruning can also be done if you're using Parquet files without all the dirN
syntax.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I have been writing a few simple utility functions for Drill and staring at
> the cumbersome dirN conditions required to take advantage of directory
> pruning.
>
> Would it be possible to allow UDFs to throw fileOutOfScope and
> directoryOutOfScope exceptions that would allow me to a) write a failry
> clever inRange(from, to, dirN...) function and would b) allow for
> additional pruning during execution?
>
> Maybe I'm seeing this all wrong but the process of complicating all queries
> with a, sometimes quite complicated, dirN tail just seems like too much
> redundancy.
>
> Regards,
>  -Stefan
>

Re: directory pruning and UDFs

Posted by Ted Dunning <te...@gmail.com>.
Stefan,

What you say sounds intriguing.  Can you show an example of how it would
look?



On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I have been writing a few simple utility functions for Drill and staring at
> the cumbersome dirN conditions required to take advantage of directory
> pruning.
>
> Would it be possible to allow UDFs to throw fileOutOfScope and
> directoryOutOfScope exceptions that would allow me to a) write a failry
> clever inRange(from, to, dirN...) function and would b) allow for
> additional pruning during execution?
>
> Maybe I'm seeing this all wrong but the process of complicating all queries
> with a, sometimes quite complicated, dirN tail just seems like too much
> redundancy.
>
> Regards,
>  -Stefan
>