You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Paul Rogers <pr...@mapr.com> on 2017/10/09 17:45:15 UTC

Implicit columns and maps

Hi All,

Drill provides a set of “implicit” columns to describe files: filename, suffix, fan and filepath. Drill also provides an open-ended set of partition columns: dir0, dir1, dir2, etc.

Not all readers support the above: some do and some don’t.

Drill semantics seem to treat these as semi-reserved words when a reader supports implicit columns. If a table has a “suffix” column, then Drill will treat “suffix” as an implicit column, ignoring the table column. If the user wants that table column, they can use a session option to temporarily rename the implicit column. A bit odd, perhaps, but it is our solution.

What is our desired behavior, however, if the user asks for a column that includes an implicit column as a prefix: “suffix.a”? Clearly, here, “suffix” is a map (i.e. structure) and “a” is a field within that map. Since the implicit “suffix” is never a map, should we:

1) Assume that, here, “suffix” is a map column projected from the table?
2) Issue an error?
3) Ignore the “.a” part and just return “suffix” as an implicit column?
4) Something else?

The code is murky on this point because JSON is implemented far differently than text files and so on. Each has its own rules. Do we need consistency of behavior, or is reader-specific behavior the expected design?

Thanks,

- Paul


Re: Implicit columns and maps

Posted by Paul Rogers <pr...@mapr.com>.
Hi Jinfeng,

Thanks much for your thoughts. To follow up on two topics.

First, how would you suggest we handle the following:

SELECT user, filename, `filename` FROM …

Here, the project list handed to the scan operator will have three columns, two with name “filename”. I suppose we’d need to modify SchemaPath to indicate if a column is intended to be a metadata (implicit) column or a regular (table) column. In that way, the name (TABLE, “filename”) would not collide with (METADATA, “filename”).

Second, on the directory columns, we wouldn’t want to break existing queries. So, the thought is that the user can decide whether to use `dir` (all partitions in an array) or `dir0`, `dir1` and so on.

The issue with the specific `dirn` columns is the problem that occurs if scan A sees “/foo/first.csv” while Scan B sees “/foo/bar/second.csv.” The first scan creates only dir0, the second creates dir0 and dir1. The result is a schema change in, say, the sort. (The point here is two distinct scans; we have a solution if these two files are two readers within a single scan.)

By using an array, then the schema is identical between the two scans, only the data differs.

I suspect that the “dir0”, “dir1” columns originated in a non-distributed system where the above ambiguity does not occur.

Or, is there a better way to solve the problem? Special case the “dir” columns in each operator that might otherwise have to deal with a schema change?

Thanks,

- Paul

> On Oct 27, 2017, at 10:56 PM, Jinfeng Ni <jn...@apache.org> wrote:
> 
> I think it would make sense to treat the implicit columns as reserved
> words, and if user wants to use those names as regular column, or table
> name, or any objects in the system, they should use quoted identifier.
> 
> select suffix from t1; // implicit columns takes precedence
> 
> select `suffix` from t1; // regular columns takes precedence.
> 
> That's what database would typically handles the conflicts between
> keywords/reserved words vs identifier used for regular objects.
> 
> Regarding the change from dir0, dir1 to dir[0], dir[1], personally I do not
> think it's a good idea, considering the impact on the application side (it
> will break any application that built on Drill using those columns).
> However, I agree that we should not add implicit columns in the record
> reader, then remove them later, as reported in DRILL-5542 [1].
> 
> 1. https://issues.apache.org/jira/browse/DRILL-5542
> 
> 
> On Mon, Oct 9, 2017 at 2:38 PM, Paul Rogers <pr...@mapr.com> wrote:
> 
>> As you point out, naming is a separate issue. I believe we inherited names
>> from some other system. But, it is an issue that we use “good” names for
>> implicit columns. If we add more names “createDate”, “modifcationDate”,
>> “owner”, or whatever), we end up breaking someone’s queries that has
>> columns with those names.
>> 
>> Would be good to have a prefix, as you suggest ,to separate Drill names
>> from user names. As it turns out, Drill supports maps (AKA structs) and
>> arrays, so perhaps we could have:
>> 
>> $meta$
>> |- filename
>> |- fqn
>> |- …
>> |- dir[]
>> 
>> Where “dir” is an array, rather than the current separate scalar dir0,
>> dir1, etc.
>> 
>> The above map let’s us add any number of metadata columns without
>> potentially breaking existing queries.
>> 
>> Note that we already have a problem where we can hit a hard schema change
>> because one reader sees “/a/b/c/foo.csv” while another sees “a/b/bar.csv”,
>> resulting in different numbers of “dirx” columns from the two readers.
>> 
>> Yet another issue is that, in wildcard queries (e.g. “SELECT *”), we add
>> all implicit columns, then remove them later. We should optimize this case.
>> 
>> But, even if we keep the original names, and defer the other issues, the
>> question about map semantics still stands…
>> 
>> - Paul
>> 
>>> On Oct 9, 2017, at 12:06 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
>>> 
>>>   How about changing all those “implicit” columns to have some
>> “unconventional” prefix, like an underscore (or two _ _ ); e.g. _suffix,
>> _dir0, etc .
>>> 
>>> With such a change we may need to handle the transition of existing
>> users’ code ; e.g., maybe change the priority (mentioned below) so that an
>> existing “suffix” column takes precedence over the implicit one.
>>> Or just go “cold turkey” and force the users to change.
>>> 
>>>    Just an idea,
>>> 
>>>           Boaz
>>> 
>>> On 10/9/17, 10:45 AM, "Paul Rogers" <pr...@mapr.com> wrote:
>>> 
>>>   Hi All,
>>> 
>>>   Drill provides a set of “implicit” columns to describe files:
>> filename, suffix, fan and filepath. Drill also provides an open-ended set
>> of partition columns: dir0, dir1, dir2, etc.
>>> 
>>>   Not all readers support the above: some do and some don’t.
>>> 
>>>   Drill semantics seem to treat these as semi-reserved words when a
>> reader supports implicit columns. If a table has a “suffix” column, then
>> Drill will treat “suffix” as an implicit column, ignoring the table column.
>> If the user wants that table column, they can use a session option to
>> temporarily rename the implicit column. A bit odd, perhaps, but it is our
>> solution.
>>> 
>>>   What is our desired behavior, however, if the user asks for a column
>> that includes an implicit column as a prefix: “suffix.a”? Clearly, here,
>> “suffix” is a map (i.e. structure) and “a” is a field within that map.
>> Since the implicit “suffix” is never a map, should we:
>>> 
>>>   1) Assume that, here, “suffix” is a map column projected from the
>> table?
>>>   2) Issue an error?
>>>   3) Ignore the “.a” part and just return “suffix” as an implicit
>> column?
>>>   4) Something else?
>>> 
>>>   The code is murky on this point because JSON is implemented far
>> differently than text files and so on. Each has its own rules. Do we need
>> consistency of behavior, or is reader-specific behavior the expected design?
>>> 
>>>   Thanks,
>>> 
>>>   - Paul
>>> 
>>> 
>>> 
>> 
>> 


Re: Implicit columns and maps

Posted by Jinfeng Ni <jn...@apache.org>.
I think it would make sense to treat the implicit columns as reserved
words, and if user wants to use those names as regular column, or table
name, or any objects in the system, they should use quoted identifier.

select suffix from t1; // implicit columns takes precedence

select `suffix` from t1; // regular columns takes precedence.

That's what database would typically handles the conflicts between
keywords/reserved words vs identifier used for regular objects.

Regarding the change from dir0, dir1 to dir[0], dir[1], personally I do not
think it's a good idea, considering the impact on the application side (it
will break any application that built on Drill using those columns).
However, I agree that we should not add implicit columns in the record
reader, then remove them later, as reported in DRILL-5542 [1].

1. https://issues.apache.org/jira/browse/DRILL-5542


On Mon, Oct 9, 2017 at 2:38 PM, Paul Rogers <pr...@mapr.com> wrote:

> As you point out, naming is a separate issue. I believe we inherited names
> from some other system. But, it is an issue that we use “good” names for
> implicit columns. If we add more names “createDate”, “modifcationDate”,
> “owner”, or whatever), we end up breaking someone’s queries that has
> columns with those names.
>
> Would be good to have a prefix, as you suggest ,to separate Drill names
> from user names. As it turns out, Drill supports maps (AKA structs) and
> arrays, so perhaps we could have:
>
> $meta$
> |- filename
> |- fqn
> |- …
> |- dir[]
>
> Where “dir” is an array, rather than the current separate scalar dir0,
> dir1, etc.
>
> The above map let’s us add any number of metadata columns without
> potentially breaking existing queries.
>
> Note that we already have a problem where we can hit a hard schema change
> because one reader sees “/a/b/c/foo.csv” while another sees “a/b/bar.csv”,
> resulting in different numbers of “dirx” columns from the two readers.
>
> Yet another issue is that, in wildcard queries (e.g. “SELECT *”), we add
> all implicit columns, then remove them later. We should optimize this case.
>
> But, even if we keep the original names, and defer the other issues, the
> question about map semantics still stands…
>
> - Paul
>
> > On Oct 9, 2017, at 12:06 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
> >
> >    How about changing all those “implicit” columns to have some
> “unconventional” prefix, like an underscore (or two _ _ ); e.g. _suffix,
> _dir0, etc .
> >
> > With such a change we may need to handle the transition of existing
> users’ code ; e.g., maybe change the priority (mentioned below) so that an
> existing “suffix” column takes precedence over the implicit one.
> > Or just go “cold turkey” and force the users to change.
> >
> >     Just an idea,
> >
> >            Boaz
> >
> > On 10/9/17, 10:45 AM, "Paul Rogers" <pr...@mapr.com> wrote:
> >
> >    Hi All,
> >
> >    Drill provides a set of “implicit” columns to describe files:
> filename, suffix, fan and filepath. Drill also provides an open-ended set
> of partition columns: dir0, dir1, dir2, etc.
> >
> >    Not all readers support the above: some do and some don’t.
> >
> >    Drill semantics seem to treat these as semi-reserved words when a
> reader supports implicit columns. If a table has a “suffix” column, then
> Drill will treat “suffix” as an implicit column, ignoring the table column.
> If the user wants that table column, they can use a session option to
> temporarily rename the implicit column. A bit odd, perhaps, but it is our
> solution.
> >
> >    What is our desired behavior, however, if the user asks for a column
> that includes an implicit column as a prefix: “suffix.a”? Clearly, here,
> “suffix” is a map (i.e. structure) and “a” is a field within that map.
> Since the implicit “suffix” is never a map, should we:
> >
> >    1) Assume that, here, “suffix” is a map column projected from the
> table?
> >    2) Issue an error?
> >    3) Ignore the “.a” part and just return “suffix” as an implicit
> column?
> >    4) Something else?
> >
> >    The code is murky on this point because JSON is implemented far
> differently than text files and so on. Each has its own rules. Do we need
> consistency of behavior, or is reader-specific behavior the expected design?
> >
> >    Thanks,
> >
> >    - Paul
> >
> >
> >
>
>

Re: Implicit columns and maps

Posted by Paul Rogers <pr...@mapr.com>.
As you point out, naming is a separate issue. I believe we inherited names from some other system. But, it is an issue that we use “good” names for implicit columns. If we add more names “createDate”, “modifcationDate”, “owner”, or whatever), we end up breaking someone’s queries that has columns with those names.

Would be good to have a prefix, as you suggest ,to separate Drill names from user names. As it turns out, Drill supports maps (AKA structs) and arrays, so perhaps we could have:

$meta$
|- filename
|- fqn
|- …
|- dir[]

Where “dir” is an array, rather than the current separate scalar dir0, dir1, etc.

The above map let’s us add any number of metadata columns without potentially breaking existing queries.

Note that we already have a problem where we can hit a hard schema change because one reader sees “/a/b/c/foo.csv” while another sees “a/b/bar.csv”, resulting in different numbers of “dirx” columns from the two readers.

Yet another issue is that, in wildcard queries (e.g. “SELECT *”), we add all implicit columns, then remove them later. We should optimize this case.

But, even if we keep the original names, and defer the other issues, the question about map semantics still stands…

- Paul

> On Oct 9, 2017, at 12:06 PM, Boaz Ben-Zvi <bb...@mapr.com> wrote:
> 
>    How about changing all those “implicit” columns to have some “unconventional” prefix, like an underscore (or two _ _ ); e.g. _suffix, _dir0, etc .
> 
> With such a change we may need to handle the transition of existing users’ code ; e.g., maybe change the priority (mentioned below) so that an existing “suffix” column takes precedence over the implicit one.
> Or just go “cold turkey” and force the users to change.
> 
>     Just an idea,
> 
>            Boaz  
> 
> On 10/9/17, 10:45 AM, "Paul Rogers" <pr...@mapr.com> wrote:
> 
>    Hi All,
> 
>    Drill provides a set of “implicit” columns to describe files: filename, suffix, fan and filepath. Drill also provides an open-ended set of partition columns: dir0, dir1, dir2, etc.
> 
>    Not all readers support the above: some do and some don’t.
> 
>    Drill semantics seem to treat these as semi-reserved words when a reader supports implicit columns. If a table has a “suffix” column, then Drill will treat “suffix” as an implicit column, ignoring the table column. If the user wants that table column, they can use a session option to temporarily rename the implicit column. A bit odd, perhaps, but it is our solution.
> 
>    What is our desired behavior, however, if the user asks for a column that includes an implicit column as a prefix: “suffix.a”? Clearly, here, “suffix” is a map (i.e. structure) and “a” is a field within that map. Since the implicit “suffix” is never a map, should we:
> 
>    1) Assume that, here, “suffix” is a map column projected from the table?
>    2) Issue an error?
>    3) Ignore the “.a” part and just return “suffix” as an implicit column?
>    4) Something else?
> 
>    The code is murky on this point because JSON is implemented far differently than text files and so on. Each has its own rules. Do we need consistency of behavior, or is reader-specific behavior the expected design?
> 
>    Thanks,
> 
>    - Paul
> 
> 
> 


Re: Implicit columns and maps

Posted by Boaz Ben-Zvi <bb...@mapr.com>.
    How about changing all those “implicit” columns to have some “unconventional” prefix, like an underscore (or two _ _ ); e.g. _suffix, _dir0, etc .

With such a change we may need to handle the transition of existing users’ code ; e.g., maybe change the priority (mentioned below) so that an existing “suffix” column takes precedence over the implicit one.
Or just go “cold turkey” and force the users to change.

     Just an idea,

            Boaz  

On 10/9/17, 10:45 AM, "Paul Rogers" <pr...@mapr.com> wrote:

    Hi All,
    
    Drill provides a set of “implicit” columns to describe files: filename, suffix, fan and filepath. Drill also provides an open-ended set of partition columns: dir0, dir1, dir2, etc.
    
    Not all readers support the above: some do and some don’t.
    
    Drill semantics seem to treat these as semi-reserved words when a reader supports implicit columns. If a table has a “suffix” column, then Drill will treat “suffix” as an implicit column, ignoring the table column. If the user wants that table column, they can use a session option to temporarily rename the implicit column. A bit odd, perhaps, but it is our solution.
    
    What is our desired behavior, however, if the user asks for a column that includes an implicit column as a prefix: “suffix.a”? Clearly, here, “suffix” is a map (i.e. structure) and “a” is a field within that map. Since the implicit “suffix” is never a map, should we:
    
    1) Assume that, here, “suffix” is a map column projected from the table?
    2) Issue an error?
    3) Ignore the “.a” part and just return “suffix” as an implicit column?
    4) Something else?
    
    The code is murky on this point because JSON is implemented far differently than text files and so on. Each has its own rules. Do we need consistency of behavior, or is reader-specific behavior the expected design?
    
    Thanks,
    
    - Paul