You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2017/04/03 18:08:01 UTC

Drill Parquet Partitioning Method

So as a user of Drill now for a while, I have gotten used to the idea of
partitions just being values, instead of key=value like other things (hive,
impala, others).

From a user/analyst perspective, the dir0, dir1, dirN methodology provides
quite a bit of flexibility, but to be intuitive, we have to know what that
field is... thus there has to be some transfer of knowledge on what that
value for directory names are.

With other methods, the key is right there in the directory name.

Now, I am really getting to the nitty gritty here, I know we could do
things like create a view to name the dir0 to be something.

For example:

Drill Method:
mytable
- 2017-04-01
- 2017-04-02
- 2017-04-03

Vs.

Hive method:
mytable
- day=2017-04-01
- day=2017-04-02
- day=2017-04-03


However, it takes extra admin effort, and hive, spark, etc all know the
key=value method.

Drill stands on its own here.  So, my thought is this, dir0 is nice, it
provides flexibility.  But why not have drill be able to infer key=value,
and when writing partitions (although I don't think Drill does this yet)
write using alias specified?

The more important part is the reading as the writing doesn't really work
yet. (We don't Insert into mytable Partition by day like we do in hive, if
we want to write a partition, we create table mytable/partition thus could
easily put the key value in there as needed)

So the reading.  A. This could not break anything existing.  Thus, dir0
must always work. B. Can we use a select option to enable/disable? (Would
we even need this?).

Basically, if there is a = in the partition name, split by =, make the
value to be the right side, alias be left side.

The hard parts:

The planner would have to be aware of this, so when a scan of the directory
occurs, the field name as an alias could be valid...

If I did "select * from mytable where day = '2017-04-01' but that field
didn't exist, it could error out, that said, we should know that the
directories have Key=Value format when we scan for files... it's not like
that is impossible (especially since we don't know what fields are in the
parquet files unless we do metadata).

This would also be something we should include in metadata... If we do
Key=value then boom, write to the metadata cache, and speed up planning!

So why do I think we need this?

It would sure make data created by other sources easier/quicker to read. We
wouldn't be string parsing directory names at query time, and it would just
be another avenue to make Drill a natural fit in the ecosystem...


I would be interested in community thoughts here, if there is interest I
will make a Jira


John

Re: Drill Parquet Partitioning Method

Posted by John Omernik <jo...@omernik.com>.
When I post questions like that I take a very user centric mindset. For me,
abstracting what needs to be done here to ensure help users seamlessly work
with other tools make Drill a friendly addition to any Data Science team.
If admins have to do more, or train users to handle how Drill does stuff
differently than say Hive or Impala, it's not greeted as warmly.   Perhaps
we could have a option in Drill to limit how many levels is recurses in the
subdirectories. Even Hive folks say anymore than two gets really hairy...

On Mon, Apr 3, 2017 at 5:23 PM, Jesse Yates <je...@gmail.com> wrote:

> I filed a JIRA for this a while back:
> https://issues.apache.org/jira/browse/DRILL-4615
>
> And even more recently finished a patch on our internal branch (pegged at
> 1.6 right now). Since 1.6, things have shifted a bit in the abstraction of
> managing directory partitions, so the patch wouldn't be directly
> applicable. Where I got caught up was in the directory depth issue - there
> is/was a bunch of expectation in the Easy reader, and fs partition rule,
> around directory names being tied to known depth in the directory tree. It
> makes it easy to do dir0, dir1, but hard for any other case :-/
>
> The workaround I used was having a now config for the workspace, a
> DirectoryStrategy, that uses at least a known directory name for each level
> of the hierarchy and a parsing strategy for the value. So you would have a
> directory list of, for your example, ['day'], or for a more complicated
> structure like:
>   /mytable
>      /day=1
>        /hour=2
>
> a directory list like ['day', 'hour']. Not the best, but OK if you don't
> have changing directory structures for a workspace.
>
> Happy to post the patch if someone is interested, but it would take a bit
> of work (for which I don't have time right now) to finagle it into master.
>
> --Jesse
>
> On Mon, Apr 3, 2017 at 1:56 PM Jinfeng Ni <jn...@apache.org> wrote:
>
> That's a good idea.
>
> Let me clarify one thing first. Drill has two kinds of partitions:
> auto partition, or directory-based partition.
>
> The first one is a result of using drill's CTAS partition by statement
> [1].  Both partition column name and column value are written and
> encoded in the output file. When querying such data, you can directly
> specify those partition columns in a WHERE/HAVING clause, and Drill
> will do partition pruning.
>
> The second one, directory-based partition, is to handle the cases
> where the input files are generated by other tools. It could be
> created by hive's or impala's partition by clause, or it could
> generated by some other ecosystem. For hive generated files, you are
> right the directory name has the pattern of "key=value", for files
> generated by other tools, we may see "value" only as the directory
> name. For the hive generated files, what we recommend today to user do
> is to created a view ( i.e. substring(dir0, 4) as dateCol), to extract
> the value from the directory name.
>
> Your suggestion makes sense to me. We probably may consider adding an
> option to the file system format plugin, to 'auto' parse the
> "key=value" pattern, and replace "dirN" with "key".   That option
> would work in a similar way as the options (skipFirstLine,
> extractHeader, etc) ) we added for text file [2]. I agree with you
> that it would make Drill more seamlessly work with other ecosystems.
>
> If we do that, I'm not sure if Drill could error out for "select *
> from mytable where day = '2017-04-01' ", if there is no "day" field in
> the directory names. The thing is day could come from either
> directory, or from data files. That probably need long discussion
> about how Drill should handle a possible non-exist column.
>
>
> 1. https://drill.apache.org/docs/partition-by-clause/
> 2. https://drill.apache.org/docs/text-files-csv-tsv-psv/
>
> On Mon, Apr 3, 2017 at 11:08 AM, John Omernik <jo...@omernik.com> wrote:
> > So as a user of Drill now for a while, I have gotten used to the idea of
> > partitions just being values, instead of key=value like other things
> (hive,
> > impala, others).
> >
> > From a user/analyst perspective, the dir0, dir1, dirN methodology
> provides
> > quite a bit of flexibility, but to be intuitive, we have to know what
> that
> > field is... thus there has to be some transfer of knowledge on what that
> > value for directory names are.
> >
> > With other methods, the key is right there in the directory name.
> >
> > Now, I am really getting to the nitty gritty here, I know we could do
> > things like create a view to name the dir0 to be something.
> >
> > For example:
> >
> > Drill Method:
> > mytable
> > - 2017-04-01
> > - 2017-04-02
> > - 2017-04-03
> >
> > Vs.
> >
> > Hive method:
> > mytable
> > - day=2017-04-01
> > - day=2017-04-02
> > - day=2017-04-03
> >
> >
> > However, it takes extra admin effort, and hive, spark, etc all know the
> > key=value method.
> >
> > Drill stands on its own here.  So, my thought is this, dir0 is nice, it
> > provides flexibility.  But why not have drill be able to infer key=value,
> > and when writing partitions (although I don't think Drill does this yet)
> > write using alias specified?
> >
> > The more important part is the reading as the writing doesn't really work
> > yet. (We don't Insert into mytable Partition by day like we do in hive,
> if
> > we want to write a partition, we create table mytable/partition thus
> could
> > easily put the key value in there as needed)
> >
> > So the reading.  A. This could not break anything existing.  Thus, dir0
> > must always work. B. Can we use a select option to enable/disable? (Would
> > we even need this?).
> >
> > Basically, if there is a = in the partition name, split by =, make the
> > value to be the right side, alias be left side.
> >
> > The hard parts:
> >
> > The planner would have to be aware of this, so when a scan of the
> directory
> > occurs, the field name as an alias could be valid...
> >
> > If I did "select * from mytable where day = '2017-04-01' but that field
> > didn't exist, it could error out, that said, we should know that the
> > directories have Key=Value format when we scan for files... it's not like
> > that is impossible (especially since we don't know what fields are in the
> > parquet files unless we do metadata).
> >
> > This would also be something we should include in metadata... If we do
> > Key=value then boom, write to the metadata cache, and speed up planning!
> >
> > So why do I think we need this?
> >
> > It would sure make data created by other sources easier/quicker to read.
> We
> > wouldn't be string parsing directory names at query time, and it would
> just
> > be another avenue to make Drill a natural fit in the ecosystem...
> >
> >
> > I would be interested in community thoughts here, if there is interest I
> > will make a Jira
> >
> >
> > John
>
> --
> Jesse Yates
> Founder/CEO Fineo.io
> Book a meeting: https://calendly.com/jyates
>

Re: Drill Parquet Partitioning Method

Posted by Jesse Yates <je...@gmail.com>.
I filed a JIRA for this a while back:
https://issues.apache.org/jira/browse/DRILL-4615

And even more recently finished a patch on our internal branch (pegged at
1.6 right now). Since 1.6, things have shifted a bit in the abstraction of
managing directory partitions, so the patch wouldn't be directly
applicable. Where I got caught up was in the directory depth issue - there
is/was a bunch of expectation in the Easy reader, and fs partition rule,
around directory names being tied to known depth in the directory tree. It
makes it easy to do dir0, dir1, but hard for any other case :-/

The workaround I used was having a now config for the workspace, a
DirectoryStrategy, that uses at least a known directory name for each level
of the hierarchy and a parsing strategy for the value. So you would have a
directory list of, for your example, ['day'], or for a more complicated
structure like:
  /mytable
     /day=1
       /hour=2

a directory list like ['day', 'hour']. Not the best, but OK if you don't
have changing directory structures for a workspace.

Happy to post the patch if someone is interested, but it would take a bit
of work (for which I don't have time right now) to finagle it into master.

--Jesse

On Mon, Apr 3, 2017 at 1:56 PM Jinfeng Ni <jn...@apache.org> wrote:

That's a good idea.

Let me clarify one thing first. Drill has two kinds of partitions:
auto partition, or directory-based partition.

The first one is a result of using drill's CTAS partition by statement
[1].  Both partition column name and column value are written and
encoded in the output file. When querying such data, you can directly
specify those partition columns in a WHERE/HAVING clause, and Drill
will do partition pruning.

The second one, directory-based partition, is to handle the cases
where the input files are generated by other tools. It could be
created by hive's or impala's partition by clause, or it could
generated by some other ecosystem. For hive generated files, you are
right the directory name has the pattern of "key=value", for files
generated by other tools, we may see "value" only as the directory
name. For the hive generated files, what we recommend today to user do
is to created a view ( i.e. substring(dir0, 4) as dateCol), to extract
the value from the directory name.

Your suggestion makes sense to me. We probably may consider adding an
option to the file system format plugin, to 'auto' parse the
"key=value" pattern, and replace "dirN" with "key".   That option
would work in a similar way as the options (skipFirstLine,
extractHeader, etc) ) we added for text file [2]. I agree with you
that it would make Drill more seamlessly work with other ecosystems.

If we do that, I'm not sure if Drill could error out for "select *
from mytable where day = '2017-04-01' ", if there is no "day" field in
the directory names. The thing is day could come from either
directory, or from data files. That probably need long discussion
about how Drill should handle a possible non-exist column.


1. https://drill.apache.org/docs/partition-by-clause/
2. https://drill.apache.org/docs/text-files-csv-tsv-psv/

On Mon, Apr 3, 2017 at 11:08 AM, John Omernik <jo...@omernik.com> wrote:
> So as a user of Drill now for a while, I have gotten used to the idea of
> partitions just being values, instead of key=value like other things
(hive,
> impala, others).
>
> From a user/analyst perspective, the dir0, dir1, dirN methodology provides
> quite a bit of flexibility, but to be intuitive, we have to know what that
> field is... thus there has to be some transfer of knowledge on what that
> value for directory names are.
>
> With other methods, the key is right there in the directory name.
>
> Now, I am really getting to the nitty gritty here, I know we could do
> things like create a view to name the dir0 to be something.
>
> For example:
>
> Drill Method:
> mytable
> - 2017-04-01
> - 2017-04-02
> - 2017-04-03
>
> Vs.
>
> Hive method:
> mytable
> - day=2017-04-01
> - day=2017-04-02
> - day=2017-04-03
>
>
> However, it takes extra admin effort, and hive, spark, etc all know the
> key=value method.
>
> Drill stands on its own here.  So, my thought is this, dir0 is nice, it
> provides flexibility.  But why not have drill be able to infer key=value,
> and when writing partitions (although I don't think Drill does this yet)
> write using alias specified?
>
> The more important part is the reading as the writing doesn't really work
> yet. (We don't Insert into mytable Partition by day like we do in hive, if
> we want to write a partition, we create table mytable/partition thus could
> easily put the key value in there as needed)
>
> So the reading.  A. This could not break anything existing.  Thus, dir0
> must always work. B. Can we use a select option to enable/disable? (Would
> we even need this?).
>
> Basically, if there is a = in the partition name, split by =, make the
> value to be the right side, alias be left side.
>
> The hard parts:
>
> The planner would have to be aware of this, so when a scan of the
directory
> occurs, the field name as an alias could be valid...
>
> If I did "select * from mytable where day = '2017-04-01' but that field
> didn't exist, it could error out, that said, we should know that the
> directories have Key=Value format when we scan for files... it's not like
> that is impossible (especially since we don't know what fields are in the
> parquet files unless we do metadata).
>
> This would also be something we should include in metadata... If we do
> Key=value then boom, write to the metadata cache, and speed up planning!
>
> So why do I think we need this?
>
> It would sure make data created by other sources easier/quicker to read.
We
> wouldn't be string parsing directory names at query time, and it would
just
> be another avenue to make Drill a natural fit in the ecosystem...
>
>
> I would be interested in community thoughts here, if there is interest I
> will make a Jira
>
>
> John

-- 
Jesse Yates
Founder/CEO Fineo.io
Book a meeting: https://calendly.com/jyates

Re: Drill Parquet Partitioning Method

Posted by Jinfeng Ni <jn...@apache.org>.
That's a good idea.

Let me clarify one thing first. Drill has two kinds of partitions:
auto partition, or directory-based partition.

The first one is a result of using drill's CTAS partition by statement
[1].  Both partition column name and column value are written and
encoded in the output file. When querying such data, you can directly
specify those partition columns in a WHERE/HAVING clause, and Drill
will do partition pruning.

The second one, directory-based partition, is to handle the cases
where the input files are generated by other tools. It could be
created by hive's or impala's partition by clause, or it could
generated by some other ecosystem. For hive generated files, you are
right the directory name has the pattern of "key=value", for files
generated by other tools, we may see "value" only as the directory
name. For the hive generated files, what we recommend today to user do
is to created a view ( i.e. substring(dir0, 4) as dateCol), to extract
the value from the directory name.

Your suggestion makes sense to me. We probably may consider adding an
option to the file system format plugin, to 'auto' parse the
"key=value" pattern, and replace "dirN" with "key".   That option
would work in a similar way as the options (skipFirstLine,
extractHeader, etc) ) we added for text file [2]. I agree with you
that it would make Drill more seamlessly work with other ecosystems.

If we do that, I'm not sure if Drill could error out for "select *
from mytable where day = '2017-04-01' ", if there is no "day" field in
the directory names. The thing is day could come from either
directory, or from data files. That probably need long discussion
about how Drill should handle a possible non-exist column.


1. https://drill.apache.org/docs/partition-by-clause/
2. https://drill.apache.org/docs/text-files-csv-tsv-psv/

On Mon, Apr 3, 2017 at 11:08 AM, John Omernik <jo...@omernik.com> wrote:
> So as a user of Drill now for a while, I have gotten used to the idea of
> partitions just being values, instead of key=value like other things (hive,
> impala, others).
>
> From a user/analyst perspective, the dir0, dir1, dirN methodology provides
> quite a bit of flexibility, but to be intuitive, we have to know what that
> field is... thus there has to be some transfer of knowledge on what that
> value for directory names are.
>
> With other methods, the key is right there in the directory name.
>
> Now, I am really getting to the nitty gritty here, I know we could do
> things like create a view to name the dir0 to be something.
>
> For example:
>
> Drill Method:
> mytable
> - 2017-04-01
> - 2017-04-02
> - 2017-04-03
>
> Vs.
>
> Hive method:
> mytable
> - day=2017-04-01
> - day=2017-04-02
> - day=2017-04-03
>
>
> However, it takes extra admin effort, and hive, spark, etc all know the
> key=value method.
>
> Drill stands on its own here.  So, my thought is this, dir0 is nice, it
> provides flexibility.  But why not have drill be able to infer key=value,
> and when writing partitions (although I don't think Drill does this yet)
> write using alias specified?
>
> The more important part is the reading as the writing doesn't really work
> yet. (We don't Insert into mytable Partition by day like we do in hive, if
> we want to write a partition, we create table mytable/partition thus could
> easily put the key value in there as needed)
>
> So the reading.  A. This could not break anything existing.  Thus, dir0
> must always work. B. Can we use a select option to enable/disable? (Would
> we even need this?).
>
> Basically, if there is a = in the partition name, split by =, make the
> value to be the right side, alias be left side.
>
> The hard parts:
>
> The planner would have to be aware of this, so when a scan of the directory
> occurs, the field name as an alias could be valid...
>
> If I did "select * from mytable where day = '2017-04-01' but that field
> didn't exist, it could error out, that said, we should know that the
> directories have Key=Value format when we scan for files... it's not like
> that is impossible (especially since we don't know what fields are in the
> parquet files unless we do metadata).
>
> This would also be something we should include in metadata... If we do
> Key=value then boom, write to the metadata cache, and speed up planning!
>
> So why do I think we need this?
>
> It would sure make data created by other sources easier/quicker to read. We
> wouldn't be string parsing directory names at query time, and it would just
> be another avenue to make Drill a natural fit in the ecosystem...
>
>
> I would be interested in community thoughts here, if there is interest I
> will make a Jira
>
>
> John