You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/12/14 10:06:08 UTC

Avro - Schema is good - Schema validation is bad

Hi,

I'm getting the following error when querying Avro files:

Error: VALIDATION ERROR: From line 1, column 48 to line 1, column 57:
Column 'some_col' not found in any table

It's true that the field is in none of the tables I'm targeting, in that
particular query, but that does not mean that it is in none of the possible
files I could be querying.

We use Avro to get the benefits of the schema but I never expected Drill to
enforce it this way.

Why do unresolved  columns not return null?

This makes no sense to me as I think a fundamental trade of Drill, when
trying to eliminate ETL, is to return null for any missing fields.

Please advise.

Regards,
 -Stefán

Re: Avro - Schema is good - Schema validation is bad

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Dec 16, 2015 at 6:18 AM, Jacques Nadeau <ja...@dremio.com> wrote:

> With regards to Ted's concern: I agree that applying a filter shouldn't
> fail a query. That means we will either have to consider the complete union
> Schema before pruning files or consider all fields as either known or
> possible after pruning files.
>

I think that is exactly the choice.

And I think that Drill can also use "possible" as an approximation of the
truth and that truth can be refined as the result of querying.  For
instance, if querying over the entire set of files in a directory were to
cause Drill to emit a statistical summary of all of those files, then on
later queries, Drill might be able to quickly say that none of the files
have a particular column and thus fail the query.

Re: Avro - Schema is good - Schema validation is bad

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

I appreciate being able to turn it off, thank you.

IMO scanning all the files in a directory structure sounds harmless until
you start using directory pruning, on large structures, when this becomes a
overhead.

I really think this is an essential part of an overdue discussion on the
"ground rules of Drill" and what is needed in an environment that
supports evolving
schema and aims to eliminate ETL.

I would suggest that all formats behave in the same way and that the
leniency allowed in Parquet, for example, is also allowed in Avro.

In other words, and to be a bit more direct, this should not be a format
specific thing and strict schema validation should never be done "just
because we can with XX format".

Happy holidays,
 -Stefan

On Fri, Dec 25, 2015 at 2:57 PM, Kamesh <ka...@gmail.com> wrote:

> For doing schema validation for Avro files, we are thinking of doing the
> following steps.Please provide your suggestions, before we implement this.
>
>    - Schema validation feature will be configurable and user can provide
>    whether he/she wants this feature or not by setting a configuration
>    property like PigAvroStorage
>    <https://cwiki.apache.org/confluence/display/PIG/AvroStorage>.
>    - If the schema validation flag is set, then we can consider the union
>    schema of all the files in a directory recursively.
>
>
> On Fri, Dec 18, 2015 at 9:17 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi Kamesh,
> >
> > This is, strictly speaking, not the same issue even though they have in
> > common the fact that the Avro schema validation aspect.
> >
> > Regards,
> >  -Stefán
> >
> > On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <ka...@gmail.com> wrote:
> >
> > > If there are any suggestion, can we take it in the JIRA. I feel, there
> is
> > > already JIRA for this.
> > >
> > >
> >
> https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
> > >
> > >
> > > On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Directory pruning is great. It allows us, for example, to do
> efficient
> > > > date-range queries even when our data is arranged in a day or week
> > based
> > > > directory structure.
> > > >
> > > > We would like to be able to run the same query for all this data even
> > > > though the schema has changes slightly (new fields added) over time.
> > > >
> > > > For me there are two thing in this scenario that are unreasonable:
> > > >
> > > >    1. For Drill to have to get the schema for all possible files
> (union
> > > >    based) to validate queries
> > > >    - adding 100s of *irrelevant* files to the mix
> > > >
> > > >    2. For Drill to fail the query because a field is not found in the
> > > >    sub-set (directory pruned sub-set)
> > > >
> > > > The current approach results in option 2 and the proposed solution
> > > results
> > > > in option 1 (As I understand it)
> > > >
> > > > We would be perfectly happy with unknown fields resulting in null as
> > > there
> > > > are many ways to deal with null values built into Drill.
> > > >
> > > > Hopefully this a) makes sense and b) is acceptable.
> > > >
> > > > Enforcing a strict schema for Avro could be an optional feature
> (IMO).
> > > >
> > > > Regards,
> > > >   -Stefán
> > > >
> > > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <ja...@dremio.com>
> > > > wrote:
> > > >
> > > > > I think the main problem your hitting is that we should do a union
> of
> > > all
> > > > > files. In that case, as long as the field is in a single file,
> we're
> > > > going
> > > > > to let the field through.
> > > > >
> > > > > There is a balancing between early termination and flexibility that
> > we
> > > > must
> > > > > provide. If someone types a field and it is guaranteed to not be in
> > the
> > > > > data, the thinking is we should fail the query early as that is
> > > probably
> > > > a
> > > > > mistake on the user's part.  If it could be a valid field, we
> proceed
> > > > with
> > > > > execution and null it out until we find something.  That is the
> goal
> > > > > anyway. Clearly we have a bug here as we should never deny a
> possible
> > > or
> > > > > known field.
> > > > >
> > > > > I think of fields in three categories: known, possible, impossible.
> > > > > Impossible fields should fail to validate. Possible and known
> fields
> > > > should
> > > > > validate and execute.
> > > > >
> > > > > With regards to Ted's concern: I agree that applying a filter
> > shouldn't
> > > > > fail a query. That means we will either have to consider the
> complete
> > > > union
> > > > > Schema before pruning files or consider all fields as either known
> or
> > > > > possible after pruning files.
> > > > >
> > > > > Stefan, if you haven't already, please open a bug that known fields
> > are
> > > > > failing to validate in Avro and we will fix shortly. Sorry about
> the
> > > bug.
> > > > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <
> stefan@activitystream.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Well, at least I'm not alone here.
> > > > > >
> > > > > > I think it must be time to set some ground rules for these things
> > and
> > > > > what
> > > > > > it means to support evolving schema and what is needed to
> eliminate
> > > > ETL.
> > > > > >
> > > > > > I trust that enforcing a strict schema "just because we think we
> > can"
> > > > > must
> > > > > > go against the principles of such rules.
> > > > > >
> > > > > > We moved all our stuff to Avro to avoid various problems with
> type
> > > > > handling
> > > > > > (assuming Double on nulls etc.) and to be hit with this, after
> all
> > > that
> > > > > > work, is like a slap in the face with two pilchards (more here:
> > > > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Sigh of relief is premature.  Nobody has committed to carrying
> > this
> > > > > > > interpretation forward.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > > > > stefan@activitystream.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > /me sighs of relief
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> > > > ted.dunning@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Actually, even without multiple storage types, this could
> be
> > > > > > radically
> > > > > > > > > confusing.
> > > > > > > > >
> > > > > > > > > If I have many avro files that are partitioned into
> > > directories,
> > > > > then
> > > > > > > > > queries that use the partitioning to limit the files that I
> > see
> > > > > could
> > > > > > > > > include or exclude more recent files that have added a new
> > > field.
> > > > > > > > >
> > > > > > > > > That means that a query would succeed or fail according to
> > > which
> > > > > date
> > > > > > > > range
> > > > > > > > > I use for the query.
> > > > > > > > >
> > > > > > > > > That seems pretty radically bad.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > > > > stefan@activitystream.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > This simply can not be the desired behavior!
> > > > > > > > > >
> > > > > > > > > > This prevents from using a field from a changing schema
> > with
> > > > dir0
> > > > > > > > > > sub-selection (directory pruning) as the altered/full
> > schema
> > > is
> > > > > > never
> > > > > > > > > part
> > > > > > > > > > of the query and it subsequently fails.
> > > > > > > > > >
> > > > > > > > > > Drill should, IMOP, never have rules that are dependent
> on
> > > the
> > > > > > > > underlying
> > > > > > > > > > storage type. If the query runs with JSON and Parquet
> then
> > it
> > > > > > should
> > > > > > > > work
> > > > > > > > > > for Avro as well.
> > > > > > > > > >
> > > > > > > > > > I'm hoping this strict schema validation is all just a
> > > > > > > > misunderstanding.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >  -Stefán
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> > > > kamesh.hadoop@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > For Avro files, we first construct the schema, and this
> > > > schema
> > > > > is
> > > > > > > > used
> > > > > > > > > > for
> > > > > > > > > > > validating queries. So, if there are any errors in the
> > > query
> > > > > > (like
> > > > > > > > the
> > > > > > > > > > > invalid field references) it will fail fast. As of now,
> > for
> > > > > other
> > > > > > > > file
> > > > > > > > > > > formats, query validation (checking  for invalid field
> > > > > reference)
> > > > > > > > does
> > > > > > > > > > not
> > > > > > > > > > > happen, and at run time, it constructs the schema for
> > them
> > > > and
> > > > > > > hence
> > > > > > > > > > nulls
> > > > > > > > > > > for invalid fields.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > > > > stefan@activitystream.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > I'm getting the following error when querying Avro
> > files:
> > > > > > > > > > > >
> > > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to
> line
> > > 1,
> > > > > > column
> > > > > > > > 57:
> > > > > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > > > > >
> > > > > > > > > > > > It's true that the field is in none of the tables I'm
> > > > > > targeting,
> > > > > > > in
> > > > > > > > > > that
> > > > > > > > > > > > particular query, but that does not mean that it is
> in
> > > none
> > > > > of
> > > > > > > the
> > > > > > > > > > > possible
> > > > > > > > > > > > files I could be querying.
> > > > > > > > > > > >
> > > > > > > > > > > > We use Avro to get the benefits of the schema but I
> > never
> > > > > > > expected
> > > > > > > > > > Drill
> > > > > > > > > > > to
> > > > > > > > > > > > enforce it this way.
> > > > > > > > > > > >
> > > > > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > > > > >
> > > > > > > > > > > > This makes no sense to me as I think a fundamental
> > trade
> > > of
> > > > > > > Drill,
> > > > > > > > > when
> > > > > > > > > > > > trying to eliminate ETL, is to return null for any
> > > missing
> > > > > > > fields.
> > > > > > > > > > > >
> > > > > > > > > > > > Please advise.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >  -Stefán
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Kamesh.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kamesh.
> > >
> >
>
>
>
> --
> Kamesh.
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Kamesh <ka...@gmail.com>.
For doing schema validation for Avro files, we are thinking of doing the
following steps.Please provide your suggestions, before we implement this.

   - Schema validation feature will be configurable and user can provide
   whether he/she wants this feature or not by setting a configuration
   property like PigAvroStorage
   <https://cwiki.apache.org/confluence/display/PIG/AvroStorage>.
   - If the schema validation flag is set, then we can consider the union
   schema of all the files in a directory recursively.


On Fri, Dec 18, 2015 at 9:17 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi Kamesh,
>
> This is, strictly speaking, not the same issue even though they have in
> common the fact that the Avro schema validation aspect.
>
> Regards,
>  -Stefán
>
> On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <ka...@gmail.com> wrote:
>
> > If there are any suggestion, can we take it in the JIRA. I feel, there is
> > already JIRA for this.
> >
> >
> https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
> >
> >
> > On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Directory pruning is great. It allows us, for example, to do efficient
> > > date-range queries even when our data is arranged in a day or week
> based
> > > directory structure.
> > >
> > > We would like to be able to run the same query for all this data even
> > > though the schema has changes slightly (new fields added) over time.
> > >
> > > For me there are two thing in this scenario that are unreasonable:
> > >
> > >    1. For Drill to have to get the schema for all possible files (union
> > >    based) to validate queries
> > >    - adding 100s of *irrelevant* files to the mix
> > >
> > >    2. For Drill to fail the query because a field is not found in the
> > >    sub-set (directory pruned sub-set)
> > >
> > > The current approach results in option 2 and the proposed solution
> > results
> > > in option 1 (As I understand it)
> > >
> > > We would be perfectly happy with unknown fields resulting in null as
> > there
> > > are many ways to deal with null values built into Drill.
> > >
> > > Hopefully this a) makes sense and b) is acceptable.
> > >
> > > Enforcing a strict schema for Avro could be an optional feature (IMO).
> > >
> > > Regards,
> > >   -Stefán
> > >
> > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <ja...@dremio.com>
> > > wrote:
> > >
> > > > I think the main problem your hitting is that we should do a union of
> > all
> > > > files. In that case, as long as the field is in a single file, we're
> > > going
> > > > to let the field through.
> > > >
> > > > There is a balancing between early termination and flexibility that
> we
> > > must
> > > > provide. If someone types a field and it is guaranteed to not be in
> the
> > > > data, the thinking is we should fail the query early as that is
> > probably
> > > a
> > > > mistake on the user's part.  If it could be a valid field, we proceed
> > > with
> > > > execution and null it out until we find something.  That is the goal
> > > > anyway. Clearly we have a bug here as we should never deny a possible
> > or
> > > > known field.
> > > >
> > > > I think of fields in three categories: known, possible, impossible.
> > > > Impossible fields should fail to validate. Possible and known fields
> > > should
> > > > validate and execute.
> > > >
> > > > With regards to Ted's concern: I agree that applying a filter
> shouldn't
> > > > fail a query. That means we will either have to consider the complete
> > > union
> > > > Schema before pruning files or consider all fields as either known or
> > > > possible after pruning files.
> > > >
> > > > Stefan, if you haven't already, please open a bug that known fields
> are
> > > > failing to validate in Avro and we will fix shortly. Sorry about the
> > bug.
> > > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <stefan@activitystream.com
> >
> > > > wrote:
> > > >
> > > > > Well, at least I'm not alone here.
> > > > >
> > > > > I think it must be time to set some ground rules for these things
> and
> > > > what
> > > > > it means to support evolving schema and what is needed to eliminate
> > > ETL.
> > > > >
> > > > > I trust that enforcing a strict schema "just because we think we
> can"
> > > > must
> > > > > go against the principles of such rules.
> > > > >
> > > > > We moved all our stuff to Avro to avoid various problems with type
> > > > handling
> > > > > (assuming Double on nulls etc.) and to be hit with this, after all
> > that
> > > > > work, is like a slap in the face with two pilchards (more here:
> > > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > > > >
> > > > > Regards,
> > > > >  -Stefán
> > > > >
> > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Sigh of relief is premature.  Nobody has committed to carrying
> this
> > > > > > interpretation forward.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > > > stefan@activitystream.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > /me sighs of relief
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> > > ted.dunning@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Actually, even without multiple storage types, this could be
> > > > > radically
> > > > > > > > confusing.
> > > > > > > >
> > > > > > > > If I have many avro files that are partitioned into
> > directories,
> > > > then
> > > > > > > > queries that use the partitioning to limit the files that I
> see
> > > > could
> > > > > > > > include or exclude more recent files that have added a new
> > field.
> > > > > > > >
> > > > > > > > That means that a query would succeed or fail according to
> > which
> > > > date
> > > > > > > range
> > > > > > > > I use for the query.
> > > > > > > >
> > > > > > > > That seems pretty radically bad.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > > > stefan@activitystream.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > This simply can not be the desired behavior!
> > > > > > > > >
> > > > > > > > > This prevents from using a field from a changing schema
> with
> > > dir0
> > > > > > > > > sub-selection (directory pruning) as the altered/full
> schema
> > is
> > > > > never
> > > > > > > > part
> > > > > > > > > of the query and it subsequently fails.
> > > > > > > > >
> > > > > > > > > Drill should, IMOP, never have rules that are dependent on
> > the
> > > > > > > underlying
> > > > > > > > > storage type. If the query runs with JSON and Parquet then
> it
> > > > > should
> > > > > > > work
> > > > > > > > > for Avro as well.
> > > > > > > > >
> > > > > > > > > I'm hoping this strict schema validation is all just a
> > > > > > > misunderstanding.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >  -Stefán
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> > > kamesh.hadoop@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For Avro files, we first construct the schema, and this
> > > schema
> > > > is
> > > > > > > used
> > > > > > > > > for
> > > > > > > > > > validating queries. So, if there are any errors in the
> > query
> > > > > (like
> > > > > > > the
> > > > > > > > > > invalid field references) it will fail fast. As of now,
> for
> > > > other
> > > > > > > file
> > > > > > > > > > formats, query validation (checking  for invalid field
> > > > reference)
> > > > > > > does
> > > > > > > > > not
> > > > > > > > > > happen, and at run time, it constructs the schema for
> them
> > > and
> > > > > > hence
> > > > > > > > > nulls
> > > > > > > > > > for invalid fields.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > > > stefan@activitystream.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I'm getting the following error when querying Avro
> files:
> > > > > > > > > > >
> > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line
> > 1,
> > > > > column
> > > > > > > 57:
> > > > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > > > >
> > > > > > > > > > > It's true that the field is in none of the tables I'm
> > > > > targeting,
> > > > > > in
> > > > > > > > > that
> > > > > > > > > > > particular query, but that does not mean that it is in
> > none
> > > > of
> > > > > > the
> > > > > > > > > > possible
> > > > > > > > > > > files I could be querying.
> > > > > > > > > > >
> > > > > > > > > > > We use Avro to get the benefits of the schema but I
> never
> > > > > > expected
> > > > > > > > > Drill
> > > > > > > > > > to
> > > > > > > > > > > enforce it this way.
> > > > > > > > > > >
> > > > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > > > >
> > > > > > > > > > > This makes no sense to me as I think a fundamental
> trade
> > of
> > > > > > Drill,
> > > > > > > > when
> > > > > > > > > > > trying to eliminate ETL, is to return null for any
> > missing
> > > > > > fields.
> > > > > > > > > > >
> > > > > > > > > > > Please advise.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >  -Stefán
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Kamesh.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Kamesh.
> >
>



-- 
Kamesh.

Re: Avro - Schema is good - Schema validation is bad

Posted by Stefán Baxter <st...@activitystream.com>.
Hi Kamesh,

This is, strictly speaking, not the same issue even though they have in
common the fact that the Avro schema validation aspect.

Regards,
 -Stefán

On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <ka...@gmail.com> wrote:

> If there are any suggestion, can we take it in the JIRA. I feel, there is
> already JIRA for this.
>
> https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
>
>
> On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > Directory pruning is great. It allows us, for example, to do efficient
> > date-range queries even when our data is arranged in a day or week based
> > directory structure.
> >
> > We would like to be able to run the same query for all this data even
> > though the schema has changes slightly (new fields added) over time.
> >
> > For me there are two thing in this scenario that are unreasonable:
> >
> >    1. For Drill to have to get the schema for all possible files (union
> >    based) to validate queries
> >    - adding 100s of *irrelevant* files to the mix
> >
> >    2. For Drill to fail the query because a field is not found in the
> >    sub-set (directory pruned sub-set)
> >
> > The current approach results in option 2 and the proposed solution
> results
> > in option 1 (As I understand it)
> >
> > We would be perfectly happy with unknown fields resulting in null as
> there
> > are many ways to deal with null values built into Drill.
> >
> > Hopefully this a) makes sense and b) is acceptable.
> >
> > Enforcing a strict schema for Avro could be an optional feature (IMO).
> >
> > Regards,
> >   -Stefán
> >
> > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <ja...@dremio.com>
> > wrote:
> >
> > > I think the main problem your hitting is that we should do a union of
> all
> > > files. In that case, as long as the field is in a single file, we're
> > going
> > > to let the field through.
> > >
> > > There is a balancing between early termination and flexibility that we
> > must
> > > provide. If someone types a field and it is guaranteed to not be in the
> > > data, the thinking is we should fail the query early as that is
> probably
> > a
> > > mistake on the user's part.  If it could be a valid field, we proceed
> > with
> > > execution and null it out until we find something.  That is the goal
> > > anyway. Clearly we have a bug here as we should never deny a possible
> or
> > > known field.
> > >
> > > I think of fields in three categories: known, possible, impossible.
> > > Impossible fields should fail to validate. Possible and known fields
> > should
> > > validate and execute.
> > >
> > > With regards to Ted's concern: I agree that applying a filter shouldn't
> > > fail a query. That means we will either have to consider the complete
> > union
> > > Schema before pruning files or consider all fields as either known or
> > > possible after pruning files.
> > >
> > > Stefan, if you haven't already, please open a bug that known fields are
> > > failing to validate in Avro and we will fix shortly. Sorry about the
> bug.
> > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <st...@activitystream.com>
> > > wrote:
> > >
> > > > Well, at least I'm not alone here.
> > > >
> > > > I think it must be time to set some ground rules for these things and
> > > what
> > > > it means to support evolving schema and what is needed to eliminate
> > ETL.
> > > >
> > > > I trust that enforcing a strict schema "just because we think we can"
> > > must
> > > > go against the principles of such rules.
> > > >
> > > > We moved all our stuff to Avro to avoid various problems with type
> > > handling
> > > > (assuming Double on nulls etc.) and to be hit with this, after all
> that
> > > > work, is like a slap in the face with two pilchards (more here:
> > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Sigh of relief is premature.  Nobody has committed to carrying this
> > > > > interpretation forward.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > > stefan@activitystream.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > /me sighs of relief
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Actually, even without multiple storage types, this could be
> > > > radically
> > > > > > > confusing.
> > > > > > >
> > > > > > > If I have many avro files that are partitioned into
> directories,
> > > then
> > > > > > > queries that use the partitioning to limit the files that I see
> > > could
> > > > > > > include or exclude more recent files that have added a new
> field.
> > > > > > >
> > > > > > > That means that a query would succeed or fail according to
> which
> > > date
> > > > > > range
> > > > > > > I use for the query.
> > > > > > >
> > > > > > > That seems pretty radically bad.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > > stefan@activitystream.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This simply can not be the desired behavior!
> > > > > > > >
> > > > > > > > This prevents from using a field from a changing schema with
> > dir0
> > > > > > > > sub-selection (directory pruning) as the altered/full schema
> is
> > > > never
> > > > > > > part
> > > > > > > > of the query and it subsequently fails.
> > > > > > > >
> > > > > > > > Drill should, IMOP, never have rules that are dependent on
> the
> > > > > > underlying
> > > > > > > > storage type. If the query runs with JSON and Parquet then it
> > > > should
> > > > > > work
> > > > > > > > for Avro as well.
> > > > > > > >
> > > > > > > > I'm hoping this strict schema validation is all just a
> > > > > > misunderstanding.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefán
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> > kamesh.hadoop@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > For Avro files, we first construct the schema, and this
> > schema
> > > is
> > > > > > used
> > > > > > > > for
> > > > > > > > > validating queries. So, if there are any errors in the
> query
> > > > (like
> > > > > > the
> > > > > > > > > invalid field references) it will fail fast. As of now, for
> > > other
> > > > > > file
> > > > > > > > > formats, query validation (checking  for invalid field
> > > reference)
> > > > > > does
> > > > > > > > not
> > > > > > > > > happen, and at run time, it constructs the schema for them
> > and
> > > > > hence
> > > > > > > > nulls
> > > > > > > > > for invalid fields.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > > stefan@activitystream.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I'm getting the following error when querying Avro files:
> > > > > > > > > >
> > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line
> 1,
> > > > column
> > > > > > 57:
> > > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > > >
> > > > > > > > > > It's true that the field is in none of the tables I'm
> > > > targeting,
> > > > > in
> > > > > > > > that
> > > > > > > > > > particular query, but that does not mean that it is in
> none
> > > of
> > > > > the
> > > > > > > > > possible
> > > > > > > > > > files I could be querying.
> > > > > > > > > >
> > > > > > > > > > We use Avro to get the benefits of the schema but I never
> > > > > expected
> > > > > > > > Drill
> > > > > > > > > to
> > > > > > > > > > enforce it this way.
> > > > > > > > > >
> > > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > > >
> > > > > > > > > > This makes no sense to me as I think a fundamental trade
> of
> > > > > Drill,
> > > > > > > when
> > > > > > > > > > trying to eliminate ETL, is to return null for any
> missing
> > > > > fields.
> > > > > > > > > >
> > > > > > > > > > Please advise.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >  -Stefán
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Kamesh.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Kamesh.
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Kamesh <ka...@gmail.com>.
If there are any suggestion, can we take it in the JIRA. I feel, there is
already JIRA for this.
https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070


On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> Directory pruning is great. It allows us, for example, to do efficient
> date-range queries even when our data is arranged in a day or week based
> directory structure.
>
> We would like to be able to run the same query for all this data even
> though the schema has changes slightly (new fields added) over time.
>
> For me there are two thing in this scenario that are unreasonable:
>
>    1. For Drill to have to get the schema for all possible files (union
>    based) to validate queries
>    - adding 100s of *irrelevant* files to the mix
>
>    2. For Drill to fail the query because a field is not found in the
>    sub-set (directory pruned sub-set)
>
> The current approach results in option 2 and the proposed solution results
> in option 1 (As I understand it)
>
> We would be perfectly happy with unknown fields resulting in null as there
> are many ways to deal with null values built into Drill.
>
> Hopefully this a) makes sense and b) is acceptable.
>
> Enforcing a strict schema for Avro could be an optional feature (IMO).
>
> Regards,
>   -Stefán
>
> On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
> > I think the main problem your hitting is that we should do a union of all
> > files. In that case, as long as the field is in a single file, we're
> going
> > to let the field through.
> >
> > There is a balancing between early termination and flexibility that we
> must
> > provide. If someone types a field and it is guaranteed to not be in the
> > data, the thinking is we should fail the query early as that is probably
> a
> > mistake on the user's part.  If it could be a valid field, we proceed
> with
> > execution and null it out until we find something.  That is the goal
> > anyway. Clearly we have a bug here as we should never deny a possible or
> > known field.
> >
> > I think of fields in three categories: known, possible, impossible.
> > Impossible fields should fail to validate. Possible and known fields
> should
> > validate and execute.
> >
> > With regards to Ted's concern: I agree that applying a filter shouldn't
> > fail a query. That means we will either have to consider the complete
> union
> > Schema before pruning files or consider all fields as either known or
> > possible after pruning files.
> >
> > Stefan, if you haven't already, please open a bug that known fields are
> > failing to validate in Avro and we will fix shortly. Sorry about the bug.
> > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <st...@activitystream.com>
> > wrote:
> >
> > > Well, at least I'm not alone here.
> > >
> > > I think it must be time to set some ground rules for these things and
> > what
> > > it means to support evolving schema and what is needed to eliminate
> ETL.
> > >
> > > I trust that enforcing a strict schema "just because we think we can"
> > must
> > > go against the principles of such rules.
> > >
> > > We moved all our stuff to Avro to avoid various problems with type
> > handling
> > > (assuming Double on nulls etc.) and to be hit with this, after all that
> > > work, is like a slap in the face with two pilchards (more here:
> > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > >
> > > Regards,
> > >  -Stefán
> > >
> > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Sigh of relief is premature.  Nobody has committed to carrying this
> > > > interpretation forward.
> > > >
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > stefan@activitystream.com
> > > > >
> > > > wrote:
> > > >
> > > > > /me sighs of relief
> > > > >
> > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Actually, even without multiple storage types, this could be
> > > radically
> > > > > > confusing.
> > > > > >
> > > > > > If I have many avro files that are partitioned into directories,
> > then
> > > > > > queries that use the partitioning to limit the files that I see
> > could
> > > > > > include or exclude more recent files that have added a new field.
> > > > > >
> > > > > > That means that a query would succeed or fail according to which
> > date
> > > > > range
> > > > > > I use for the query.
> > > > > >
> > > > > > That seems pretty radically bad.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > stefan@activitystream.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This simply can not be the desired behavior!
> > > > > > >
> > > > > > > This prevents from using a field from a changing schema with
> dir0
> > > > > > > sub-selection (directory pruning) as the altered/full schema is
> > > never
> > > > > > part
> > > > > > > of the query and it subsequently fails.
> > > > > > >
> > > > > > > Drill should, IMOP, never have rules that are dependent on the
> > > > > underlying
> > > > > > > storage type. If the query runs with JSON and Parquet then it
> > > should
> > > > > work
> > > > > > > for Avro as well.
> > > > > > >
> > > > > > > I'm hoping this strict schema validation is all just a
> > > > > misunderstanding.
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefán
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> kamesh.hadoop@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > For Avro files, we first construct the schema, and this
> schema
> > is
> > > > > used
> > > > > > > for
> > > > > > > > validating queries. So, if there are any errors in the query
> > > (like
> > > > > the
> > > > > > > > invalid field references) it will fail fast. As of now, for
> > other
> > > > > file
> > > > > > > > formats, query validation (checking  for invalid field
> > reference)
> > > > > does
> > > > > > > not
> > > > > > > > happen, and at run time, it constructs the schema for them
> and
> > > > hence
> > > > > > > nulls
> > > > > > > > for invalid fields.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > stefan@activitystream.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I'm getting the following error when querying Avro files:
> > > > > > > > >
> > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1,
> > > column
> > > > > 57:
> > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > >
> > > > > > > > > It's true that the field is in none of the tables I'm
> > > targeting,
> > > > in
> > > > > > > that
> > > > > > > > > particular query, but that does not mean that it is in none
> > of
> > > > the
> > > > > > > > possible
> > > > > > > > > files I could be querying.
> > > > > > > > >
> > > > > > > > > We use Avro to get the benefits of the schema but I never
> > > > expected
> > > > > > > Drill
> > > > > > > > to
> > > > > > > > > enforce it this way.
> > > > > > > > >
> > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > >
> > > > > > > > > This makes no sense to me as I think a fundamental trade of
> > > > Drill,
> > > > > > when
> > > > > > > > > trying to eliminate ETL, is to return null for any missing
> > > > fields.
> > > > > > > > >
> > > > > > > > > Please advise.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >  -Stefán
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Kamesh.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Kamesh.

Re: Avro - Schema is good - Schema validation is bad

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

Directory pruning is great. It allows us, for example, to do efficient
date-range queries even when our data is arranged in a day or week based
directory structure.

We would like to be able to run the same query for all this data even
though the schema has changes slightly (new fields added) over time.

For me there are two thing in this scenario that are unreasonable:

   1. For Drill to have to get the schema for all possible files (union
   based) to validate queries
   - adding 100s of *irrelevant* files to the mix

   2. For Drill to fail the query because a field is not found in the
   sub-set (directory pruned sub-set)

The current approach results in option 2 and the proposed solution results
in option 1 (As I understand it)

We would be perfectly happy with unknown fields resulting in null as there
are many ways to deal with null values built into Drill.

Hopefully this a) makes sense and b) is acceptable.

Enforcing a strict schema for Avro could be an optional feature (IMO).

Regards,
  -Stefán

On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> I think the main problem your hitting is that we should do a union of all
> files. In that case, as long as the field is in a single file, we're going
> to let the field through.
>
> There is a balancing between early termination and flexibility that we must
> provide. If someone types a field and it is guaranteed to not be in the
> data, the thinking is we should fail the query early as that is probably a
> mistake on the user's part.  If it could be a valid field, we proceed with
> execution and null it out until we find something.  That is the goal
> anyway. Clearly we have a bug here as we should never deny a possible or
> known field.
>
> I think of fields in three categories: known, possible, impossible.
> Impossible fields should fail to validate. Possible and known fields should
> validate and execute.
>
> With regards to Ted's concern: I agree that applying a filter shouldn't
> fail a query. That means we will either have to consider the complete union
> Schema before pruning files or consider all fields as either known or
> possible after pruning files.
>
> Stefan, if you haven't already, please open a bug that known fields are
> failing to validate in Avro and we will fix shortly. Sorry about the bug.
> On Dec 14, 2015 10:51 PM, "Stefán Baxter" <st...@activitystream.com>
> wrote:
>
> > Well, at least I'm not alone here.
> >
> > I think it must be time to set some ground rules for these things and
> what
> > it means to support evolving schema and what is needed to eliminate ETL.
> >
> > I trust that enforcing a strict schema "just because we think we can"
> must
> > go against the principles of such rules.
> >
> > We moved all our stuff to Avro to avoid various problems with type
> handling
> > (assuming Double on nulls etc.) and to be hit with this, after all that
> > work, is like a slap in the face with two pilchards (more here:
> > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> >
> > Regards,
> >  -Stefán
> >
> > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Sigh of relief is premature.  Nobody has committed to carrying this
> > > interpretation forward.
> > >
> > >
> > >
> > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > stefan@activitystream.com
> > > >
> > > wrote:
> > >
> > > > /me sighs of relief
> > > >
> > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Actually, even without multiple storage types, this could be
> > radically
> > > > > confusing.
> > > > >
> > > > > If I have many avro files that are partitioned into directories,
> then
> > > > > queries that use the partitioning to limit the files that I see
> could
> > > > > include or exclude more recent files that have added a new field.
> > > > >
> > > > > That means that a query would succeed or fail according to which
> date
> > > > range
> > > > > I use for the query.
> > > > >
> > > > > That seems pretty radically bad.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > stefan@activitystream.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This simply can not be the desired behavior!
> > > > > >
> > > > > > This prevents from using a field from a changing schema with dir0
> > > > > > sub-selection (directory pruning) as the altered/full schema is
> > never
> > > > > part
> > > > > > of the query and it subsequently fails.
> > > > > >
> > > > > > Drill should, IMOP, never have rules that are dependent on the
> > > > underlying
> > > > > > storage type. If the query runs with JSON and Parquet then it
> > should
> > > > work
> > > > > > for Avro as well.
> > > > > >
> > > > > > I'm hoping this strict schema validation is all just a
> > > > misunderstanding.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <kamesh.hadoop@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > For Avro files, we first construct the schema, and this schema
> is
> > > > used
> > > > > > for
> > > > > > > validating queries. So, if there are any errors in the query
> > (like
> > > > the
> > > > > > > invalid field references) it will fail fast. As of now, for
> other
> > > > file
> > > > > > > formats, query validation (checking  for invalid field
> reference)
> > > > does
> > > > > > not
> > > > > > > happen, and at run time, it constructs the schema for them and
> > > hence
> > > > > > nulls
> > > > > > > for invalid fields.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > stefan@activitystream.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm getting the following error when querying Avro files:
> > > > > > > >
> > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1,
> > column
> > > > 57:
> > > > > > > > Column 'some_col' not found in any table
> > > > > > > >
> > > > > > > > It's true that the field is in none of the tables I'm
> > targeting,
> > > in
> > > > > > that
> > > > > > > > particular query, but that does not mean that it is in none
> of
> > > the
> > > > > > > possible
> > > > > > > > files I could be querying.
> > > > > > > >
> > > > > > > > We use Avro to get the benefits of the schema but I never
> > > expected
> > > > > > Drill
> > > > > > > to
> > > > > > > > enforce it this way.
> > > > > > > >
> > > > > > > > Why do unresolved  columns not return null?
> > > > > > > >
> > > > > > > > This makes no sense to me as I think a fundamental trade of
> > > Drill,
> > > > > when
> > > > > > > > trying to eliminate ETL, is to return null for any missing
> > > fields.
> > > > > > > >
> > > > > > > > Please advise.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefán
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Kamesh.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Jacques Nadeau <ja...@dremio.com>.
I think the main problem your hitting is that we should do a union of all
files. In that case, as long as the field is in a single file, we're going
to let the field through.

There is a balancing between early termination and flexibility that we must
provide. If someone types a field and it is guaranteed to not be in the
data, the thinking is we should fail the query early as that is probably a
mistake on the user's part.  If it could be a valid field, we proceed with
execution and null it out until we find something.  That is the goal
anyway. Clearly we have a bug here as we should never deny a possible or
known field.

I think of fields in three categories: known, possible, impossible.
Impossible fields should fail to validate. Possible and known fields should
validate and execute.

With regards to Ted's concern: I agree that applying a filter shouldn't
fail a query. That means we will either have to consider the complete union
Schema before pruning files or consider all fields as either known or
possible after pruning files.

Stefan, if you haven't already, please open a bug that known fields are
failing to validate in Avro and we will fix shortly. Sorry about the bug.
On Dec 14, 2015 10:51 PM, "Stefán Baxter" <st...@activitystream.com> wrote:

> Well, at least I'm not alone here.
>
> I think it must be time to set some ground rules for these things and what
> it means to support evolving schema and what is needed to eliminate ETL.
>
> I trust that enforcing a strict schema "just because we think we can" must
> go against the principles of such rules.
>
> We moved all our stuff to Avro to avoid various problems with type handling
> (assuming Double on nulls etc.) and to be hit with this, after all that
> work, is like a slap in the face with two pilchards (more here:
> https://www.youtube.com/watch?v=IhJQp-q1Y1s)
>
> Regards,
>  -Stefán
>
> On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Sigh of relief is premature.  Nobody has committed to carrying this
> > interpretation forward.
> >
> >
> >
> > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> stefan@activitystream.com
> > >
> > wrote:
> >
> > > /me sighs of relief
> > >
> > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Actually, even without multiple storage types, this could be
> radically
> > > > confusing.
> > > >
> > > > If I have many avro files that are partitioned into directories, then
> > > > queries that use the partitioning to limit the files that I see could
> > > > include or exclude more recent files that have added a new field.
> > > >
> > > > That means that a query would succeed or fail according to which date
> > > range
> > > > I use for the query.
> > > >
> > > > That seems pretty radically bad.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > stefan@activitystream.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > This simply can not be the desired behavior!
> > > > >
> > > > > This prevents from using a field from a changing schema with dir0
> > > > > sub-selection (directory pruning) as the altered/full schema is
> never
> > > > part
> > > > > of the query and it subsequently fails.
> > > > >
> > > > > Drill should, IMOP, never have rules that are dependent on the
> > > underlying
> > > > > storage type. If the query runs with JSON and Parquet then it
> should
> > > work
> > > > > for Avro as well.
> > > > >
> > > > > I'm hoping this strict schema validation is all just a
> > > misunderstanding.
> > > > >
> > > > > Regards,
> > > > >  -Stefán
> > > > >
> > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <ka...@gmail.com>
> > > wrote:
> > > > >
> > > > > > For Avro files, we first construct the schema, and this schema is
> > > used
> > > > > for
> > > > > > validating queries. So, if there are any errors in the query
> (like
> > > the
> > > > > > invalid field references) it will fail fast. As of now, for other
> > > file
> > > > > > formats, query validation (checking  for invalid field reference)
> > > does
> > > > > not
> > > > > > happen, and at run time, it constructs the schema for them and
> > hence
> > > > > nulls
> > > > > > for invalid fields.
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > stefan@activitystream.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm getting the following error when querying Avro files:
> > > > > > >
> > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1,
> column
> > > 57:
> > > > > > > Column 'some_col' not found in any table
> > > > > > >
> > > > > > > It's true that the field is in none of the tables I'm
> targeting,
> > in
> > > > > that
> > > > > > > particular query, but that does not mean that it is in none of
> > the
> > > > > > possible
> > > > > > > files I could be querying.
> > > > > > >
> > > > > > > We use Avro to get the benefits of the schema but I never
> > expected
> > > > > Drill
> > > > > > to
> > > > > > > enforce it this way.
> > > > > > >
> > > > > > > Why do unresolved  columns not return null?
> > > > > > >
> > > > > > > This makes no sense to me as I think a fundamental trade of
> > Drill,
> > > > when
> > > > > > > trying to eliminate ETL, is to return null for any missing
> > fields.
> > > > > > >
> > > > > > > Please advise.
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefán
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kamesh.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Stefán Baxter <st...@activitystream.com>.
Well, at least I'm not alone here.

I think it must be time to set some ground rules for these things and what
it means to support evolving schema and what is needed to eliminate ETL.

I trust that enforcing a strict schema "just because we think we can" must
go against the principles of such rules.

We moved all our stuff to Avro to avoid various problems with type handling
(assuming Double on nulls etc.) and to be hit with this, after all that
work, is like a slap in the face with two pilchards (more here:
https://www.youtube.com/watch?v=IhJQp-q1Y1s)

Regards,
 -Stefán

On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <te...@gmail.com> wrote:

> Sigh of relief is premature.  Nobody has committed to carrying this
> interpretation forward.
>
>
>
> On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <stefan@activitystream.com
> >
> wrote:
>
> > /me sighs of relief
> >
> > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Actually, even without multiple storage types, this could be radically
> > > confusing.
> > >
> > > If I have many avro files that are partitioned into directories, then
> > > queries that use the partitioning to limit the files that I see could
> > > include or exclude more recent files that have added a new field.
> > >
> > > That means that a query would succeed or fail according to which date
> > range
> > > I use for the query.
> > >
> > > That seems pretty radically bad.
> > >
> > >
> > >
> > >
> > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > This simply can not be the desired behavior!
> > > >
> > > > This prevents from using a field from a changing schema with dir0
> > > > sub-selection (directory pruning) as the altered/full schema is never
> > > part
> > > > of the query and it subsequently fails.
> > > >
> > > > Drill should, IMOP, never have rules that are dependent on the
> > underlying
> > > > storage type. If the query runs with JSON and Parquet then it should
> > work
> > > > for Avro as well.
> > > >
> > > > I'm hoping this strict schema validation is all just a
> > misunderstanding.
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <ka...@gmail.com>
> > wrote:
> > > >
> > > > > For Avro files, we first construct the schema, and this schema is
> > used
> > > > for
> > > > > validating queries. So, if there are any errors in the query (like
> > the
> > > > > invalid field references) it will fail fast. As of now, for other
> > file
> > > > > formats, query validation (checking  for invalid field reference)
> > does
> > > > not
> > > > > happen, and at run time, it constructs the schema for them and
> hence
> > > > nulls
> > > > > for invalid fields.
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > stefan@activitystream.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm getting the following error when querying Avro files:
> > > > > >
> > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1, column
> > 57:
> > > > > > Column 'some_col' not found in any table
> > > > > >
> > > > > > It's true that the field is in none of the tables I'm targeting,
> in
> > > > that
> > > > > > particular query, but that does not mean that it is in none of
> the
> > > > > possible
> > > > > > files I could be querying.
> > > > > >
> > > > > > We use Avro to get the benefits of the schema but I never
> expected
> > > > Drill
> > > > > to
> > > > > > enforce it this way.
> > > > > >
> > > > > > Why do unresolved  columns not return null?
> > > > > >
> > > > > > This makes no sense to me as I think a fundamental trade of
> Drill,
> > > when
> > > > > > trying to eliminate ETL, is to return null for any missing
> fields.
> > > > > >
> > > > > > Please advise.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kamesh.
> > > > >
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Ted Dunning <te...@gmail.com>.
Sigh of relief is premature.  Nobody has committed to carrying this
interpretation forward.



On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> /me sighs of relief
>
> On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Actually, even without multiple storage types, this could be radically
> > confusing.
> >
> > If I have many avro files that are partitioned into directories, then
> > queries that use the partitioning to limit the files that I see could
> > include or exclude more recent files that have added a new field.
> >
> > That means that a query would succeed or fail according to which date
> range
> > I use for the query.
> >
> > That seems pretty radically bad.
> >
> >
> >
> >
> > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Hi,
> > >
> > > This simply can not be the desired behavior!
> > >
> > > This prevents from using a field from a changing schema with dir0
> > > sub-selection (directory pruning) as the altered/full schema is never
> > part
> > > of the query and it subsequently fails.
> > >
> > > Drill should, IMOP, never have rules that are dependent on the
> underlying
> > > storage type. If the query runs with JSON and Parquet then it should
> work
> > > for Avro as well.
> > >
> > > I'm hoping this strict schema validation is all just a
> misunderstanding.
> > >
> > > Regards,
> > >  -Stefán
> > >
> > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <ka...@gmail.com>
> wrote:
> > >
> > > > For Avro files, we first construct the schema, and this schema is
> used
> > > for
> > > > validating queries. So, if there are any errors in the query (like
> the
> > > > invalid field references) it will fail fast. As of now, for other
> file
> > > > formats, query validation (checking  for invalid field reference)
> does
> > > not
> > > > happen, and at run time, it constructs the schema for them and hence
> > > nulls
> > > > for invalid fields.
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > stefan@activitystream.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm getting the following error when querying Avro files:
> > > > >
> > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1, column
> 57:
> > > > > Column 'some_col' not found in any table
> > > > >
> > > > > It's true that the field is in none of the tables I'm targeting, in
> > > that
> > > > > particular query, but that does not mean that it is in none of the
> > > > possible
> > > > > files I could be querying.
> > > > >
> > > > > We use Avro to get the benefits of the schema but I never expected
> > > Drill
> > > > to
> > > > > enforce it this way.
> > > > >
> > > > > Why do unresolved  columns not return null?
> > > > >
> > > > > This makes no sense to me as I think a fundamental trade of Drill,
> > when
> > > > > trying to eliminate ETL, is to return null for any missing fields.
> > > > >
> > > > > Please advise.
> > > > >
> > > > > Regards,
> > > > >  -Stefán
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kamesh.
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Stefán Baxter <st...@activitystream.com>.
/me sighs of relief

On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <te...@gmail.com> wrote:

> Actually, even without multiple storage types, this could be radically
> confusing.
>
> If I have many avro files that are partitioned into directories, then
> queries that use the partitioning to limit the files that I see could
> include or exclude more recent files that have added a new field.
>
> That means that a query would succeed or fail according to which date range
> I use for the query.
>
> That seems pretty radically bad.
>
>
>
>
> On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > This simply can not be the desired behavior!
> >
> > This prevents from using a field from a changing schema with dir0
> > sub-selection (directory pruning) as the altered/full schema is never
> part
> > of the query and it subsequently fails.
> >
> > Drill should, IMOP, never have rules that are dependent on the underlying
> > storage type. If the query runs with JSON and Parquet then it should work
> > for Avro as well.
> >
> > I'm hoping this strict schema validation is all just a misunderstanding.
> >
> > Regards,
> >  -Stefán
> >
> > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <ka...@gmail.com> wrote:
> >
> > > For Avro files, we first construct the schema, and this schema is used
> > for
> > > validating queries. So, if there are any errors in the query (like the
> > > invalid field references) it will fail fast. As of now, for other file
> > > formats, query validation (checking  for invalid field reference) does
> > not
> > > happen, and at run time, it constructs the schema for them and hence
> > nulls
> > > for invalid fields.
> > >
> > >
> > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm getting the following error when querying Avro files:
> > > >
> > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1, column 57:
> > > > Column 'some_col' not found in any table
> > > >
> > > > It's true that the field is in none of the tables I'm targeting, in
> > that
> > > > particular query, but that does not mean that it is in none of the
> > > possible
> > > > files I could be querying.
> > > >
> > > > We use Avro to get the benefits of the schema but I never expected
> > Drill
> > > to
> > > > enforce it this way.
> > > >
> > > > Why do unresolved  columns not return null?
> > > >
> > > > This makes no sense to me as I think a fundamental trade of Drill,
> when
> > > > trying to eliminate ETL, is to return null for any missing fields.
> > > >
> > > > Please advise.
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > >
> > >
> > >
> > > --
> > > Kamesh.
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Ted Dunning <te...@gmail.com>.
Actually, even without multiple storage types, this could be radically
confusing.

If I have many avro files that are partitioned into directories, then
queries that use the partitioning to limit the files that I see could
include or exclude more recent files that have added a new field.

That means that a query would succeed or fail according to which date range
I use for the query.

That seems pretty radically bad.




On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> This simply can not be the desired behavior!
>
> This prevents from using a field from a changing schema with dir0
> sub-selection (directory pruning) as the altered/full schema is never part
> of the query and it subsequently fails.
>
> Drill should, IMOP, never have rules that are dependent on the underlying
> storage type. If the query runs with JSON and Parquet then it should work
> for Avro as well.
>
> I'm hoping this strict schema validation is all just a misunderstanding.
>
> Regards,
>  -Stefán
>
> On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <ka...@gmail.com> wrote:
>
> > For Avro files, we first construct the schema, and this schema is used
> for
> > validating queries. So, if there are any errors in the query (like the
> > invalid field references) it will fail fast. As of now, for other file
> > formats, query validation (checking  for invalid field reference) does
> not
> > happen, and at run time, it constructs the schema for them and hence
> nulls
> > for invalid fields.
> >
> >
> > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm getting the following error when querying Avro files:
> > >
> > > Error: VALIDATION ERROR: From line 1, column 48 to line 1, column 57:
> > > Column 'some_col' not found in any table
> > >
> > > It's true that the field is in none of the tables I'm targeting, in
> that
> > > particular query, but that does not mean that it is in none of the
> > possible
> > > files I could be querying.
> > >
> > > We use Avro to get the benefits of the schema but I never expected
> Drill
> > to
> > > enforce it this way.
> > >
> > > Why do unresolved  columns not return null?
> > >
> > > This makes no sense to me as I think a fundamental trade of Drill, when
> > > trying to eliminate ETL, is to return null for any missing fields.
> > >
> > > Please advise.
> > >
> > > Regards,
> > >  -Stefán
> > >
> >
> >
> >
> > --
> > Kamesh.
> >
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

This simply can not be the desired behavior!

This prevents from using a field from a changing schema with dir0
sub-selection (directory pruning) as the altered/full schema is never part
of the query and it subsequently fails.

Drill should, IMOP, never have rules that are dependent on the underlying
storage type. If the query runs with JSON and Parquet then it should work
for Avro as well.

I'm hoping this strict schema validation is all just a misunderstanding.

Regards,
 -Stefán

On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <ka...@gmail.com> wrote:

> For Avro files, we first construct the schema, and this schema is used for
> validating queries. So, if there are any errors in the query (like the
> invalid field references) it will fail fast. As of now, for other file
> formats, query validation (checking  for invalid field reference) does not
> happen, and at run time, it constructs the schema for them and hence nulls
> for invalid fields.
>
>
> On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > I'm getting the following error when querying Avro files:
> >
> > Error: VALIDATION ERROR: From line 1, column 48 to line 1, column 57:
> > Column 'some_col' not found in any table
> >
> > It's true that the field is in none of the tables I'm targeting, in that
> > particular query, but that does not mean that it is in none of the
> possible
> > files I could be querying.
> >
> > We use Avro to get the benefits of the schema but I never expected Drill
> to
> > enforce it this way.
> >
> > Why do unresolved  columns not return null?
> >
> > This makes no sense to me as I think a fundamental trade of Drill, when
> > trying to eliminate ETL, is to return null for any missing fields.
> >
> > Please advise.
> >
> > Regards,
> >  -Stefán
> >
>
>
>
> --
> Kamesh.
>

Re: Avro - Schema is good - Schema validation is bad

Posted by Kamesh <ka...@gmail.com>.
For Avro files, we first construct the schema, and this schema is used for
validating queries. So, if there are any errors in the query (like the
invalid field references) it will fail fast. As of now, for other file
formats, query validation (checking  for invalid field reference) does not
happen, and at run time, it constructs the schema for them and hence nulls
for invalid fields.


On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I'm getting the following error when querying Avro files:
>
> Error: VALIDATION ERROR: From line 1, column 48 to line 1, column 57:
> Column 'some_col' not found in any table
>
> It's true that the field is in none of the tables I'm targeting, in that
> particular query, but that does not mean that it is in none of the possible
> files I could be querying.
>
> We use Avro to get the benefits of the schema but I never expected Drill to
> enforce it this way.
>
> Why do unresolved  columns not return null?
>
> This makes no sense to me as I think a fundamental trade of Drill, when
> trying to eliminate ETL, is to return null for any missing fields.
>
> Please advise.
>
> Regards,
>  -Stefán
>



-- 
Kamesh.