You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jacques Nadeau <ja...@dremio.com> on 2015/10/26 22:49:45 UTC

Request for more feedback on "Support the Ability to Identify And Skip Records" design

Hsuan was kind enough to put together a provocative discussion on the
mailing list about skipping records. I've started a way too long thread in
the comments discussion but would like to get other feedback from the
community. The main point of contention I have is that the big goal of this
design is to provide "data import" like capabilities for Drill. In that
context, I suggested a scan based approach to schema enforcement (and bad
record capture/storage). I think it is a simpler approach and solves the
vast majority of user needs. Hsuan's initial proposal was a much broader
reaching proposal that supports an arbitrary number of expression types
within project and filter (assuming they are proximate to the scan).

Would love to get others feedback and thoughts on the doc to what the MVP
for this feature really is.

https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit


--
Jacques Nadeau
CTO and Co-Founder, Dremio

Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Posted by Neeraja Rentachintala <nr...@maprtech.com>.

Jacques
Thanks for the details.
I am trying to understand whats the difference between 3 & 4.
Here is how I am thinking of the scenario. Its probably better to discuss
this in hang out.

I have some data coming from an external system and I expect them to be in
certain format. Checked first couple of rows and it seem to be sticking to
the format. I have written a Drill query or a view to interpret the data
(for ex: converting certain fields to a date or timestamp, casting to a
specific type etc). However certain records seem to be corrupted such as
prefixed with non-printable characters. I need to have special handling for
these records (for ex: I want to identify what these are so I can either
fix them or choose to skip them). I am still in data exploration phase at
this point.

There is an extension use case of ETL/Data import where I probably have
millions of text files coming in and I am using Drill to convert all of
this to Parquet using CTAS. Some of the records in these files could be
corrupted and I need special handling for these (potentially skip them or
move them to a separate file) without interrupting the whole data
conversion.

-Neeraja

On Tue, Oct 27, 2015 at 8:52 AM, Jacques Nadeau <ja...@dremio.com> wrote:

> There seem to be multiple user requirements that are being considered in
> Hsuan & Juliens' proposals:
>
> 1. Drill doesn't have enough information to parse my data, I want to give
> Drill help. (Examples might me: the field delimiter is "|", the proto idl
> encoding for a protobuf file is "...", provide an external avro schema )
> 2. While Drill can parse my data, the structure output is incomplete. It
> may be missing field types and/or field names. I want to tell Drill how to
> interpret that data since the format itself doesn't provide an adequate way
> to express this (typically text files as opposed to json, parquet)
> 3. I've defined an expected structure to my data files. If some records
> don't match that, I want to have special handling to manage those records
> (e.g. drop, warn number of drops, create separate file with provenance of
> each failing record)
> 4. I have an arbitrary query and I want any data-specific execution
> failures to be squelched to allow the query to complete with whatever data
> remains.
>
> My recommendation is that we have three new features:
>
> A. table with options (what julien is working on)
> B. .drill files (https://issues.apache.org/jira/browse/DRILL-3572)
> C. alter table ascribe metadata (to create a .drill file through sql)
> D. Support using table with options (A) to override settings in .drill (B)
>
> I believe that A & B (and C since it is simply a derivative of B) should
> provide the capability to achieve requirements 1-3 above.
>
> When Neeraja talks of the exploration use case, feature A is probably the
> most common way that people will do this. In the case of use case 3 above,
> if someone wants to use a "recordPositionAndError" behavior (see
> DRILL-3572), they will most likely want to do that in the context of a
> query (as opposed to a view or .drill).  As such, you would probably create
> a .drill file that did warn or ignore. Then layer over the top (via feature
> D) a recordPositionAndError if you want that for a certain situation.
>
> My main thought on Hsuan's initial proposal is it seems to try to provide
> an incomplete resolution of #4 above. It isn't clear to me that use case #4
> is a critical use case for most users. If it is, can we get some concrete
> examples of it as opposed to use cases 1-3? If it is a critical use case, I
> think we should solve it in a more general way (for example I don't think
> we should try to maintain file-based record provenance in that context).
> Among other things, the current proposal has the weird problem of not being
> consistent in how the user experiences the behavior (depending on what plan
> Drill decides to execute.)
>
> Note, there were some questions about how 1-3 could be solved using B so
> I've provided an example in the Jira:
> https://issues.apache.org/jira/browse/DRILL-3572
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Oct 26, 2015 at 4:09 PM, Zelaine Fong <zf...@maprtech.com> wrote:
>
> > My understanding of Jacques' proposal is that he suggests we use .drill
> > instead of requiring the user to do an explicit cast in their select
> > query.  That way, the changes for enhancement would be restricted to the
> > scanner.
> >
> > Did I interpret the alternative approach correctly?
> >
> > -- Zelaine
> >
> > On Mon, Oct 26, 2015 at 4:05 PM, Hsuan Yi Chu <hy...@maprtech.com>
> wrote:
> >
> > > Hi,
> > >
> > > Luckily, we will have hang-out tomorrow.
> > >
> > > Maybe we could have an example to elaborate how .drill can be used in a
> > > cast-query?
> > >
> > > Thanks.
> > >
> > >
> > > On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala <
> > > nrentachintala@maprtech.com> wrote:
> > >
> > > > Jacques
> > > > I have responded to one of your comments on the doc.
> > > > can you pls review and comment. I am not clear on the approach you
> are
> > > > suggesting using .drill and what would that mean to user experience.
> It
> > > > would be great if you can add an example.
> > > >
> > > > Similar to other thread (initiated by Julien) we have around being
> able
> > > to
> > > > provide file parsing hints from the query itself for self service
> data
> > > > exploration purposes, we need this feature to be fairly light weight
> > > from a
> > > > user experience point of view. i.e me as a business user got hold of
> > some
> > > > external data, want to take a look by running adhoc queries on Drill
> ,
> > I
> > > > should be able to do it without having to go through whole setup of
> > > .drill
> > > > etc which will come later as the data is 'operationalized'
> > > >
> > > > thanks
> > > > -Neeraja
> > > >
> > > > On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <ja...@dremio.com>
> > > > wrote:
> > > >
> > > > > Hsuan was kind enough to put together a provocative discussion on
> the
> > > > > mailing list about skipping records. I've started a way too long
> > thread
> > > > in
> > > > > the comments discussion but would like to get other feedback from
> the
> > > > > community. The main point of contention I have is that the big goal
> > of
> > > > this
> > > > > design is to provide "data import" like capabilities for Drill. In
> > that
> > > > > context, I suggested a scan based approach to schema enforcement
> (and
> > > bad
> > > > > record capture/storage). I think it is a simpler approach and
> solves
> > > the
> > > > > vast majority of user needs. Hsuan's initial proposal was a much
> > > broader
> > > > > reaching proposal that supports an arbitrary number of expression
> > types
> > > > > within project and filter (assuming they are proximate to the
> scan).
> > > > >
> > > > > Would love to get others feedback and thoughts on the doc to what
> the
> > > MVP
> > > > > for this feature really is.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > >
> > >
> >
>

Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Posted by Jacques Nadeau <ja...@dremio.com>.

There seem to be multiple user requirements that are being considered in
Hsuan & Juliens' proposals:

1. Drill doesn't have enough information to parse my data, I want to give
Drill help. (Examples might me: the field delimiter is "|", the proto idl
encoding for a protobuf file is "...", provide an external avro schema )
2. While Drill can parse my data, the structure output is incomplete. It
may be missing field types and/or field names. I want to tell Drill how to
interpret that data since the format itself doesn't provide an adequate way
to express this (typically text files as opposed to json, parquet)
3. I've defined an expected structure to my data files. If some records
don't match that, I want to have special handling to manage those records
(e.g. drop, warn number of drops, create separate file with provenance of
each failing record)
4. I have an arbitrary query and I want any data-specific execution
failures to be squelched to allow the query to complete with whatever data
remains.

My recommendation is that we have three new features:

A. table with options (what julien is working on)
B. .drill files (https://issues.apache.org/jira/browse/DRILL-3572)
C. alter table ascribe metadata (to create a .drill file through sql)
D. Support using table with options (A) to override settings in .drill (B)

I believe that A & B (and C since it is simply a derivative of B) should
provide the capability to achieve requirements 1-3 above.

When Neeraja talks of the exploration use case, feature A is probably the
most common way that people will do this. In the case of use case 3 above,
if someone wants to use a "recordPositionAndError" behavior (see
DRILL-3572), they will most likely want to do that in the context of a
query (as opposed to a view or .drill).  As such, you would probably create
a .drill file that did warn or ignore. Then layer over the top (via feature
D) a recordPositionAndError if you want that for a certain situation.

My main thought on Hsuan's initial proposal is it seems to try to provide
an incomplete resolution of #4 above. It isn't clear to me that use case #4
is a critical use case for most users. If it is, can we get some concrete
examples of it as opposed to use cases 1-3? If it is a critical use case, I
think we should solve it in a more general way (for example I don't think
we should try to maintain file-based record provenance in that context).
Among other things, the current proposal has the weird problem of not being
consistent in how the user experiences the behavior (depending on what plan
Drill decides to execute.)

Note, there were some questions about how 1-3 could be solved using B so
I've provided an example in the Jira:
https://issues.apache.org/jira/browse/DRILL-3572

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Oct 26, 2015 at 4:09 PM, Zelaine Fong <zf...@maprtech.com> wrote:

> My understanding of Jacques' proposal is that he suggests we use .drill
> instead of requiring the user to do an explicit cast in their select
> query.  That way, the changes for enhancement would be restricted to the
> scanner.
>
> Did I interpret the alternative approach correctly?
>
> -- Zelaine
>
> On Mon, Oct 26, 2015 at 4:05 PM, Hsuan Yi Chu <hy...@maprtech.com> wrote:
>
> > Hi,
> >
> > Luckily, we will have hang-out tomorrow.
> >
> > Maybe we could have an example to elaborate how .drill can be used in a
> > cast-query?
> >
> > Thanks.
> >
> >
> > On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala <
> > nrentachintala@maprtech.com> wrote:
> >
> > > Jacques
> > > I have responded to one of your comments on the doc.
> > > can you pls review and comment. I am not clear on the approach you are
> > > suggesting using .drill and what would that mean to user experience. It
> > > would be great if you can add an example.
> > >
> > > Similar to other thread (initiated by Julien) we have around being able
> > to
> > > provide file parsing hints from the query itself for self service data
> > > exploration purposes, we need this feature to be fairly light weight
> > from a
> > > user experience point of view. i.e me as a business user got hold of
> some
> > > external data, want to take a look by running adhoc queries on Drill ,
> I
> > > should be able to do it without having to go through whole setup of
> > .drill
> > > etc which will come later as the data is 'operationalized'
> > >
> > > thanks
> > > -Neeraja
> > >
> > > On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <ja...@dremio.com>
> > > wrote:
> > >
> > > > Hsuan was kind enough to put together a provocative discussion on the
> > > > mailing list about skipping records. I've started a way too long
> thread
> > > in
> > > > the comments discussion but would like to get other feedback from the
> > > > community. The main point of contention I have is that the big goal
> of
> > > this
> > > > design is to provide "data import" like capabilities for Drill. In
> that
> > > > context, I suggested a scan based approach to schema enforcement (and
> > bad
> > > > record capture/storage). I think it is a simpler approach and solves
> > the
> > > > vast majority of user needs. Hsuan's initial proposal was a much
> > broader
> > > > reaching proposal that supports an arbitrary number of expression
> types
> > > > within project and filter (assuming they are proximate to the scan).
> > > >
> > > > Would love to get others feedback and thoughts on the doc to what the
> > MVP
> > > > for this feature really is.
> > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > >
> >
>

Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Posted by Zelaine Fong <zf...@maprtech.com>.

My understanding of Jacques' proposal is that he suggests we use .drill
instead of requiring the user to do an explicit cast in their select
query.  That way, the changes for enhancement would be restricted to the
scanner.

Did I interpret the alternative approach correctly?

-- Zelaine

On Mon, Oct 26, 2015 at 4:05 PM, Hsuan Yi Chu <hy...@maprtech.com> wrote:

> Hi,
>
> Luckily, we will have hang-out tomorrow.
>
> Maybe we could have an example to elaborate how .drill can be used in a
> cast-query?
>
> Thanks.
>
>
> On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala <
> nrentachintala@maprtech.com> wrote:
>
> > Jacques
> > I have responded to one of your comments on the doc.
> > can you pls review and comment. I am not clear on the approach you are
> > suggesting using .drill and what would that mean to user experience. It
> > would be great if you can add an example.
> >
> > Similar to other thread (initiated by Julien) we have around being able
> to
> > provide file parsing hints from the query itself for self service data
> > exploration purposes, we need this feature to be fairly light weight
> from a
> > user experience point of view. i.e me as a business user got hold of some
> > external data, want to take a look by running adhoc queries on Drill , I
> > should be able to do it without having to go through whole setup of
> .drill
> > etc which will come later as the data is 'operationalized'
> >
> > thanks
> > -Neeraja
> >
> > On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <ja...@dremio.com>
> > wrote:
> >
> > > Hsuan was kind enough to put together a provocative discussion on the
> > > mailing list about skipping records. I've started a way too long thread
> > in
> > > the comments discussion but would like to get other feedback from the
> > > community. The main point of contention I have is that the big goal of
> > this
> > > design is to provide "data import" like capabilities for Drill. In that
> > > context, I suggested a scan based approach to schema enforcement (and
> bad
> > > record capture/storage). I think it is a simpler approach and solves
> the
> > > vast majority of user needs. Hsuan's initial proposal was a much
> broader
> > > reaching proposal that supports an arbitrary number of expression types
> > > within project and filter (assuming they are proximate to the scan).
> > >
> > > Would love to get others feedback and thoughts on the doc to what the
> MVP
> > > for this feature really is.
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> >
>

Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Posted by Hsuan Yi Chu <hy...@maprtech.com>.

Hi,

Luckily, we will have hang-out tomorrow.

Maybe we could have an example to elaborate how .drill can be used in a
cast-query?

Thanks.


On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala <
nrentachintala@maprtech.com> wrote:

> Jacques
> I have responded to one of your comments on the doc.
> can you pls review and comment. I am not clear on the approach you are
> suggesting using .drill and what would that mean to user experience. It
> would be great if you can add an example.
>
> Similar to other thread (initiated by Julien) we have around being able to
> provide file parsing hints from the query itself for self service data
> exploration purposes, we need this feature to be fairly light weight from a
> user experience point of view. i.e me as a business user got hold of some
> external data, want to take a look by running adhoc queries on Drill , I
> should be able to do it without having to go through whole setup of .drill
> etc which will come later as the data is 'operationalized'
>
> thanks
> -Neeraja
>
> On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
> > Hsuan was kind enough to put together a provocative discussion on the
> > mailing list about skipping records. I've started a way too long thread
> in
> > the comments discussion but would like to get other feedback from the
> > community. The main point of contention I have is that the big goal of
> this
> > design is to provide "data import" like capabilities for Drill. In that
> > context, I suggested a scan based approach to schema enforcement (and bad
> > record capture/storage). I think it is a simpler approach and solves the
> > vast majority of user needs. Hsuan's initial proposal was a much broader
> > reaching proposal that supports an arbitrary number of expression types
> > within project and filter (assuming they are proximate to the scan).
> >
> > Would love to get others feedback and thoughts on the doc to what the MVP
> > for this feature really is.
> >
> >
> >
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
>

Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design

Posted by Neeraja Rentachintala <nr...@maprtech.com>.

Jacques
I have responded to one of your comments on the doc.
can you pls review and comment. I am not clear on the approach you are
suggesting using .drill and what would that mean to user experience. It
would be great if you can add an example.

Similar to other thread (initiated by Julien) we have around being able to
provide file parsing hints from the query itself for self service data
exploration purposes, we need this feature to be fairly light weight from a
user experience point of view. i.e me as a business user got hold of some
external data, want to take a look by running adhoc queries on Drill , I
should be able to do it without having to go through whole setup of .drill
etc which will come later as the data is 'operationalized'

thanks
-Neeraja

On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Hsuan was kind enough to put together a provocative discussion on the
> mailing list about skipping records. I've started a way too long thread in
> the comments discussion but would like to get other feedback from the
> community. The main point of contention I have is that the big goal of this
> design is to provide "data import" like capabilities for Drill. In that
> context, I suggested a scan based approach to schema enforcement (and bad
> record capture/storage). I think it is a simpler approach and solves the
> vast majority of user needs. Hsuan's initial proposal was a much broader
> reaching proposal that supports an arbitrary number of expression types
> within project and filter (assuming they are proximate to the scan).
>
> Would love to get others feedback and thoughts on the doc to what the MVP
> for this feature really is.
>
>
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>