You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Aman Sinha <as...@maprtech.com> on 2015/09/01 18:08:42 UTC

Identifying the source of problematic records

Drill can point out the filename and location of corrupted records in a
file but we don't have a good mechanism to deal with the following
scenario:

Consider a text file with 2 records:
$ cat t4.csv
10,2001
11,http://www.cnn.com

0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;

0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
bigint) from dfs.`/Users/asinha/data/t4.csv`;

Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com

Fragment 0:0

[Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]

  (java.lang.NumberFormatException) http://www.cnn.com
    org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91

org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
    org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
    org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172

The problem is user does not have a clue about the original source of this
error.  This is a pain point especially when dealing with thousands of
files.

1.  We can start by providing the column index where the problem occurred.
2.  Can a scan batch keep track of the file it originated from ? Since the
Project in the
     above query is pushed right above the scan, it could get the filename
from the record
     batch (assuming we can store this piece of information).  This won't
be possible
     for other Projects elsewhere in the plan.
3.  What about the location within the file ?   Unless the projection is
pushed into the scan
     itself, I don't see a good way to provide this information.

A related topic is how to tell Drill to ignore such records when doing a
query or a CTAS ?
That could be a separate discussion.

Thoughts ?
Aman

Re: Identifying the source of problematic records

Posted by Aman Sinha <as...@maprtech.com>.

@Jinfeng,
  > select dir0, file_name, columns[1] where columns[1]  = '
http://www.cnn.com';
  > "(suppose we extend the dir* and include file_name as well)"

Getting the file_name is where the details come in.   Supporting the select
file_name  as a special column is more work than
what I was initially proposing.

@Jason,
yes, propagating the file name past joins and other operators seems too
much overhead.  In my email,  item #2 I was only
intending this for the Project that has been pushed right above the Scan.
This addresses a vast majority of use cases.

Aman




On Wed, Sep 2, 2015 at 1:39 PM, Jason Altekruse <al...@gmail.com>
wrote:

> While I initially wanted to say I agree with Jinfeng, I think the user
> experience should be better than having to essentially run the query twice.
> It doesn't seem crazy idea for Drill to carry around a source table for a
> batch of records (maybe even an offset/line number in the scan if it is
> known), and to continue to propagate it as long as the data has not been
> modified. It would get a little hairier when we look at the case of join,
> where most of the columns would be simple copies of the incoming data. If
> an expression is evaluated against one of these columns downstream, which
> could fail with one of these poor error messages, we should still be able
> to propagate the source table information through the join, but
> unfortunately that means we would have to store a list of source tables and
> a map to particular columns with each batch.
>
> This would not be a trivial task, but this might be a common enough pain
> point that it would be worth at least trying to come up with a design and
> work estimate for a decent solution.
>
> On Tue, Sep 1, 2015 at 12:14 PM, Jinfeng Ni <ji...@gmail.com> wrote:
>
> > It seems hard to output the filename, # of records, unless the cast
> happens
> > exactly at Scan operator. Otherwise, the input of Project could be any
> > operator, including Scan. It's hard to track the source of record in the
> > chain of operators.
> >
> > I checked how Postgres reported the error in a similar case. Suppose I
> > have two columns :
> >
> > empno   : integer
> > ename   : character varying(20)
> >
> >
> > mydb=# select empno, ename from emp;
> >  empno |   ename
> > -------+------------
> >    100 | John Jones
> >    200 | 200
> > (2 rows)
> >
> > mydb=# select empno, cast(ename as int) from emp;
> > ERROR:  invalid input syntax for integer: "John Jones"
> >
> > Given the casting error, postgres did not report where the error record
> is.
> > Use has to use a query like "select * from emp where ename = 'John Jones"
> > to figure out which record.
> >
> > In Drill, one probably could use similar approach to locate the error
> > record.
> >
> > select dir0, file_name, columns[1] where columns[1]  = '
> http://www.cnn.com
> > ';
> >
> >  (suppose we extend the dir* and include file_name as well).
> >
> >
> >
> >
> > On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <hy...@maprtech.com>
> wrote:
> >
> > > Is it possible to let record batch from scan to know "file name" and
> "the
> > > range of line numbers in this batch"?
> > >
> > > The second one sounds difficult ?
> > >
> > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com>
> wrote:
> > >
> > > > Drill can point out the filename and location of corrupted records
> in a
> > > > file but we don't have a good mechanism to deal with the following
> > > > scenario:
> > > >
> > > > Consider a text file with 2 records:
> > > > $ cat t4.csv
> > > > 10,2001
> > > > 11,http://www.cnn.com
> > > >
> > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` =
> true;
> > > >
> > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> > cast(columns[1]
> > > as
> > > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > > >
> > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > > >
> > > > Fragment 0:0
> > > >
> > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on
> 10.250.56.140:31010
> > ]
> > > >
> > > >   (java.lang.NumberFormatException) http://www.cnn.com
> > > >
>  org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > > >
> > >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > > >
> > > > The problem is user does not have a clue about the original source of
> > > this
> > > > error.  This is a pain point especially when dealing with thousands
> of
> > > > files.
> > > >
> > > > 1.  We can start by providing the column index where the problem
> > > occurred.
> > > > 2.  Can a scan batch keep track of the file it originated from ?
> Since
> > > the
> > > > Project in the
> > > >      above query is pushed right above the scan, it could get the
> > > filename
> > > > from the record
> > > >      batch (assuming we can store this piece of information).  This
> > won't
> > > > be possible
> > > >      for other Projects elsewhere in the plan.
> > > > 3.  What about the location within the file ?   Unless the projection
> > is
> > > > pushed into the scan
> > > >      itself, I don't see a good way to provide this information.
> > > >
> > > > A related topic is how to tell Drill to ignore such records when
> doing
> > a
> > > > query or a CTAS ?
> > > > That could be a separate discussion.
> > > >
> > > > Thoughts ?
> > > > Aman
> > > >
> > >
> >
>

Re: Identifying the source of problematic records

Posted by Jason Altekruse <al...@gmail.com>.

While I initially wanted to say I agree with Jinfeng, I think the user
experience should be better than having to essentially run the query twice.
It doesn't seem crazy idea for Drill to carry around a source table for a
batch of records (maybe even an offset/line number in the scan if it is
known), and to continue to propagate it as long as the data has not been
modified. It would get a little hairier when we look at the case of join,
where most of the columns would be simple copies of the incoming data. If
an expression is evaluated against one of these columns downstream, which
could fail with one of these poor error messages, we should still be able
to propagate the source table information through the join, but
unfortunately that means we would have to store a list of source tables and
a map to particular columns with each batch.

This would not be a trivial task, but this might be a common enough pain
point that it would be worth at least trying to come up with a design and
work estimate for a decent solution.

On Tue, Sep 1, 2015 at 12:14 PM, Jinfeng Ni <ji...@gmail.com> wrote:

> It seems hard to output the filename, # of records, unless the cast happens
> exactly at Scan operator. Otherwise, the input of Project could be any
> operator, including Scan. It's hard to track the source of record in the
> chain of operators.
>
> I checked how Postgres reported the error in a similar case. Suppose I
> have two columns :
>
> empno   : integer
> ename   : character varying(20)
>
>
> mydb=# select empno, ename from emp;
>  empno |   ename
> -------+------------
>    100 | John Jones
>    200 | 200
> (2 rows)
>
> mydb=# select empno, cast(ename as int) from emp;
> ERROR:  invalid input syntax for integer: "John Jones"
>
> Given the casting error, postgres did not report where the error record is.
> Use has to use a query like "select * from emp where ename = 'John Jones"
> to figure out which record.
>
> In Drill, one probably could use similar approach to locate the error
> record.
>
> select dir0, file_name, columns[1] where columns[1]  = 'http://www.cnn.com
> ';
>
>  (suppose we extend the dir* and include file_name as well).
>
>
>
>
> On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <hy...@maprtech.com> wrote:
>
> > Is it possible to let record batch from scan to know "file name" and "the
> > range of line numbers in this batch"?
> >
> > The second one sounds difficult ?
> >
> > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com> wrote:
> >
> > > Drill can point out the filename and location of corrupted records in a
> > > file but we don't have a good mechanism to deal with the following
> > > scenario:
> > >
> > > Consider a text file with 2 records:
> > > $ cat t4.csv
> > > 10,2001
> > > 11,http://www.cnn.com
> > >
> > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> > >
> > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> cast(columns[1]
> > as
> > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > >
> > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010
> ]
> > >
> > >   (java.lang.NumberFormatException) http://www.cnn.com
> > >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > >
> >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > >
> > > The problem is user does not have a clue about the original source of
> > this
> > > error.  This is a pain point especially when dealing with thousands of
> > > files.
> > >
> > > 1.  We can start by providing the column index where the problem
> > occurred.
> > > 2.  Can a scan batch keep track of the file it originated from ? Since
> > the
> > > Project in the
> > >      above query is pushed right above the scan, it could get the
> > filename
> > > from the record
> > >      batch (assuming we can store this piece of information).  This
> won't
> > > be possible
> > >      for other Projects elsewhere in the plan.
> > > 3.  What about the location within the file ?   Unless the projection
> is
> > > pushed into the scan
> > >      itself, I don't see a good way to provide this information.
> > >
> > > A related topic is how to tell Drill to ignore such records when doing
> a
> > > query or a CTAS ?
> > > That could be a separate discussion.
> > >
> > > Thoughts ?
> > > Aman
> > >
> >
>

Re: Identifying the source of problematic records

Posted by Jinfeng Ni <ji...@gmail.com>.

It seems hard to output the filename, # of records, unless the cast happens
exactly at Scan operator. Otherwise, the input of Project could be any
operator, including Scan. It's hard to track the source of record in the
chain of operators.

I checked how Postgres reported the error in a similar case. Suppose I
have two columns :

empno   : integer
ename   : character varying(20)


mydb=# select empno, ename from emp;
 empno |   ename
-------+------------
   100 | John Jones
   200 | 200
(2 rows)

mydb=# select empno, cast(ename as int) from emp;
ERROR:  invalid input syntax for integer: "John Jones"

Given the casting error, postgres did not report where the error record is.
Use has to use a query like "select * from emp where ename = 'John Jones"
to figure out which record.

In Drill, one probably could use similar approach to locate the error
record.

select dir0, file_name, columns[1] where columns[1]  = 'http://www.cnn.com';

 (suppose we extend the dir* and include file_name as well).




On Tue, Sep 1, 2015 at 10:46 AM, Hsuan Yi Chu <hy...@maprtech.com> wrote:

> Is it possible to let record batch from scan to know "file name" and "the
> range of line numbers in this batch"?
>
> The second one sounds difficult ?
>
> On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com> wrote:
>
> > Drill can point out the filename and location of corrupted records in a
> > file but we don't have a good mechanism to deal with the following
> > scenario:
> >
> > Consider a text file with 2 records:
> > $ cat t4.csv
> > 10,2001
> > 11,http://www.cnn.com
> >
> > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> >
> > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1]
> as
> > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> >
> > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> >
> > Fragment 0:0
> >
> > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
> >
> >   (java.lang.NumberFormatException) http://www.cnn.com
> >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> >
>  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> >
> > The problem is user does not have a clue about the original source of
> this
> > error.  This is a pain point especially when dealing with thousands of
> > files.
> >
> > 1.  We can start by providing the column index where the problem
> occurred.
> > 2.  Can a scan batch keep track of the file it originated from ? Since
> the
> > Project in the
> >      above query is pushed right above the scan, it could get the
> filename
> > from the record
> >      batch (assuming we can store this piece of information).  This won't
> > be possible
> >      for other Projects elsewhere in the plan.
> > 3.  What about the location within the file ?   Unless the projection is
> > pushed into the scan
> >      itself, I don't see a good way to provide this information.
> >
> > A related topic is how to tell Drill to ignore such records when doing a
> > query or a CTAS ?
> > That could be a separate discussion.
> >
> > Thoughts ?
> > Aman
> >
>

Re: Identifying the source of problematic records

Posted by Hsuan Yi Chu <hy...@maprtech.com>.

Is it possible to let record batch from scan to know "file name" and "the
range of line numbers in this batch"?

The second one sounds difficult ?

On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com> wrote:

> Drill can point out the filename and location of corrupted records in a
> file but we don't have a good mechanism to deal with the following
> scenario:
>
> Consider a text file with 2 records:
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
>
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
> bigint) from dfs.`/Users/asinha/data/t4.csv`;
>
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
>
> Fragment 0:0
>
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
>
>   (java.lang.NumberFormatException) http://www.cnn.com
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
>
> The problem is user does not have a clue about the original source of this
> error.  This is a pain point especially when dealing with thousands of
> files.
>
> 1.  We can start by providing the column index where the problem occurred.
> 2.  Can a scan batch keep track of the file it originated from ? Since the
> Project in the
>      above query is pushed right above the scan, it could get the filename
> from the record
>      batch (assuming we can store this piece of information).  This won't
> be possible
>      for other Projects elsewhere in the plan.
> 3.  What about the location within the file ?   Unless the projection is
> pushed into the scan
>      itself, I don't see a good way to provide this information.
>
> A related topic is how to tell Drill to ignore such records when doing a
> query or a CTAS ?
> That could be a separate discussion.
>
> Thoughts ?
> Aman
>

Re: Identifying the source of problematic records

Posted by Jason Altekruse <al...@gmail.com>.

I was thinking we would just put a catch around the calls to evaluate the
generated code and re-evaluate each individual expression with the
interpreter to find out which one caused the exception.

Thinking about it a little more, the call to the generated code actually
happens inside of the loop in the ProjectTemplate/FilterTemplate classes
today. This is where the information about the index in the current batch
is known, but the list of expressions is not known at this level. We might
have to add an interface to extract the last index we tried to evaluate
from the Template, so that we could use this to evaluate against the
correct row back in the RecordBatch where we have access to expressions
which can be used to materialize the interpreter.

On Thu, Sep 3, 2015 at 6:31 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> Interesting idea.  The question I have is how would this work when you have
> a combination of generated code related to expressions and code not related
> to expressions.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Sep 3, 2015 at 11:31 AM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > @Jacques,
> >
> > On your point a) about expressing failures and the compilation model, I
> had
> > thought about previously using the interpreter to figure out which
> > expression against the current row failed, once we have caught an
> exception
> > out of some part of the complete code-generated expression evaluation. Do
> > you think this would possibly address your concern? Do you think anything
> > more than the problematic input data and the expression that failed would
> > be produced by the functions in this new standardized error format?
> >
> > - Jason
> >
> > On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
> >
> > > I'd like to propose a few things to solve this:
> > >
> > > a) Functions should be able to express failures in a standardized way.
> > I'm
> > > thinking a new type of injectable and/or a certain type of exception
> > > (although more dangerous/possibly requires rewrite given compilation
> > > model).
> > > b) Users (session/system level) should be able to set a setting where
> > > function errors are handled a certain way. Options could include query
> > > failure, ignore + inform as warning/notice, and save records for later
> > > analysis (maybe in v2).
> > > c) Readers that have a notorious problem (e.g. Text) should support
> > > projection/expression pushdown so that they can create these kinds of
> > > errors and provide additional context as part of that.
> > > d) We should also implement dot drill files so that users can prescribe
> > > this projection/data validation process by default for files/diretories
> > > (which would provide the behavior as c above.
> > > e) We should get more serious about providing useful virtual fields.
> > This
> > > should include filename (similar to directory name).
> > >
> > > Once a record leaves an operator, I don't think we should carry any
> > > additional provenance with it. It would be too heavy weight as a
> default
> > > behavior.
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com>
> wrote:
> > >
> > > > Drill can point out the filename and location of corrupted records
> in a
> > > > file but we don't have a good mechanism to deal with the following
> > > > scenario:
> > > >
> > > > Consider a text file with 2 records:
> > > > $ cat t4.csv
> > > > 10,2001
> > > > 11,http://www.cnn.com
> > > >
> > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` =
> true;
> > > >
> > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> > cast(columns[1]
> > > as
> > > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > > >
> > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > > >
> > > > Fragment 0:0
> > > >
> > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on
> 10.250.56.140:31010
> > ]
> > > >
> > > >   (java.lang.NumberFormatException) http://www.cnn.com
> > > >
>  org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > > >
> > >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > > >
> > > > The problem is user does not have a clue about the original source of
> > > this
> > > > error.  This is a pain point especially when dealing with thousands
> of
> > > > files.
> > > >
> > > > 1.  We can start by providing the column index where the problem
> > > occurred.
> > > > 2.  Can a scan batch keep track of the file it originated from ?
> Since
> > > the
> > > > Project in the
> > > >      above query is pushed right above the scan, it could get the
> > > filename
> > > > from the record
> > > >      batch (assuming we can store this piece of information).  This
> > won't
> > > > be possible
> > > >      for other Projects elsewhere in the plan.
> > > > 3.  What about the location within the file ?   Unless the projection
> > is
> > > > pushed into the scan
> > > >      itself, I don't see a good way to provide this information.
> > > >
> > > > A related topic is how to tell Drill to ignore such records when
> doing
> > a
> > > > query or a CTAS ?
> > > > That could be a separate discussion.
> > > >
> > > > Thoughts ?
> > > > Aman
> > > >
> > >
> >
>

Re: Identifying the source of problematic records

Posted by Jacques Nadeau <ja...@dremio.com>.

Interesting idea.  The question I have is how would this work when you have
a combination of generated code related to expressions and code not related
to expressions.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Sep 3, 2015 at 11:31 AM, Jason Altekruse <al...@gmail.com>
wrote:

> @Jacques,
>
> On your point a) about expressing failures and the compilation model, I had
> thought about previously using the interpreter to figure out which
> expression against the current row failed, once we have caught an exception
> out of some part of the complete code-generated expression evaluation. Do
> you think this would possibly address your concern? Do you think anything
> more than the problematic input data and the expression that failed would
> be produced by the functions in this new standardized error format?
>
> - Jason
>
> On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <ja...@dremio.com> wrote:
>
> > I'd like to propose a few things to solve this:
> >
> > a) Functions should be able to express failures in a standardized way.
> I'm
> > thinking a new type of injectable and/or a certain type of exception
> > (although more dangerous/possibly requires rewrite given compilation
> > model).
> > b) Users (session/system level) should be able to set a setting where
> > function errors are handled a certain way. Options could include query
> > failure, ignore + inform as warning/notice, and save records for later
> > analysis (maybe in v2).
> > c) Readers that have a notorious problem (e.g. Text) should support
> > projection/expression pushdown so that they can create these kinds of
> > errors and provide additional context as part of that.
> > d) We should also implement dot drill files so that users can prescribe
> > this projection/data validation process by default for files/diretories
> > (which would provide the behavior as c above.
> > e) We should get more serious about providing useful virtual fields.
> This
> > should include filename (similar to directory name).
> >
> > Once a record leaves an operator, I don't think we should carry any
> > additional provenance with it. It would be too heavy weight as a default
> > behavior.
> >
> >
> >
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com> wrote:
> >
> > > Drill can point out the filename and location of corrupted records in a
> > > file but we don't have a good mechanism to deal with the following
> > > scenario:
> > >
> > > Consider a text file with 2 records:
> > > $ cat t4.csv
> > > 10,2001
> > > 11,http://www.cnn.com
> > >
> > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> > >
> > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> cast(columns[1]
> > as
> > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > >
> > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010
> ]
> > >
> > >   (java.lang.NumberFormatException) http://www.cnn.com
> > >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > >
> >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > >
> > > The problem is user does not have a clue about the original source of
> > this
> > > error.  This is a pain point especially when dealing with thousands of
> > > files.
> > >
> > > 1.  We can start by providing the column index where the problem
> > occurred.
> > > 2.  Can a scan batch keep track of the file it originated from ? Since
> > the
> > > Project in the
> > >      above query is pushed right above the scan, it could get the
> > filename
> > > from the record
> > >      batch (assuming we can store this piece of information).  This
> won't
> > > be possible
> > >      for other Projects elsewhere in the plan.
> > > 3.  What about the location within the file ?   Unless the projection
> is
> > > pushed into the scan
> > >      itself, I don't see a good way to provide this information.
> > >
> > > A related topic is how to tell Drill to ignore such records when doing
> a
> > > query or a CTAS ?
> > > That could be a separate discussion.
> > >
> > > Thoughts ?
> > > Aman
> > >
> >
>

Re: Identifying the source of problematic records

Posted by Jason Altekruse <al...@gmail.com>.

@Jacques,

On your point a) about expressing failures and the compilation model, I had
thought about previously using the interpreter to figure out which
expression against the current row failed, once we have caught an exception
out of some part of the complete code-generated expression evaluation. Do
you think this would possibly address your concern? Do you think anything
more than the problematic input data and the expression that failed would
be produced by the functions in this new standardized error format?

- Jason

On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> I'd like to propose a few things to solve this:
>
> a) Functions should be able to express failures in a standardized way. I'm
> thinking a new type of injectable and/or a certain type of exception
> (although more dangerous/possibly requires rewrite given compilation
> model).
> b) Users (session/system level) should be able to set a setting where
> function errors are handled a certain way. Options could include query
> failure, ignore + inform as warning/notice, and save records for later
> analysis (maybe in v2).
> c) Readers that have a notorious problem (e.g. Text) should support
> projection/expression pushdown so that they can create these kinds of
> errors and provide additional context as part of that.
> d) We should also implement dot drill files so that users can prescribe
> this projection/data validation process by default for files/diretories
> (which would provide the behavior as c above.
> e) We should get more serious about providing useful virtual fields.  This
> should include filename (similar to directory name).
>
> Once a record leaves an operator, I don't think we should carry any
> additional provenance with it. It would be too heavy weight as a default
> behavior.
>
>
>
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com> wrote:
>
> > Drill can point out the filename and location of corrupted records in a
> > file but we don't have a good mechanism to deal with the following
> > scenario:
> >
> > Consider a text file with 2 records:
> > $ cat t4.csv
> > 10,2001
> > 11,http://www.cnn.com
> >
> > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> >
> > 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1]
> as
> > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> >
> > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> >
> > Fragment 0:0
> >
> > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
> >
> >   (java.lang.NumberFormatException) http://www.cnn.com
> >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> >
>  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> >
> > The problem is user does not have a clue about the original source of
> this
> > error.  This is a pain point especially when dealing with thousands of
> > files.
> >
> > 1.  We can start by providing the column index where the problem
> occurred.
> > 2.  Can a scan batch keep track of the file it originated from ? Since
> the
> > Project in the
> >      above query is pushed right above the scan, it could get the
> filename
> > from the record
> >      batch (assuming we can store this piece of information).  This won't
> > be possible
> >      for other Projects elsewhere in the plan.
> > 3.  What about the location within the file ?   Unless the projection is
> > pushed into the scan
> >      itself, I don't see a good way to provide this information.
> >
> > A related topic is how to tell Drill to ignore such records when doing a
> > query or a CTAS ?
> > That could be a separate discussion.
> >
> > Thoughts ?
> > Aman
> >
>

Re: Identifying the source of problematic records

Posted by Jacques Nadeau <ja...@dremio.com>.

I'd like to propose a few things to solve this:

a) Functions should be able to express failures in a standardized way. I'm
thinking a new type of injectable and/or a certain type of exception
(although more dangerous/possibly requires rewrite given compilation
model).
b) Users (session/system level) should be able to set a setting where
function errors are handled a certain way. Options could include query
failure, ignore + inform as warning/notice, and save records for later
analysis (maybe in v2).
c) Readers that have a notorious problem (e.g. Text) should support
projection/expression pushdown so that they can create these kinds of
errors and provide additional context as part of that.
d) We should also implement dot drill files so that users can prescribe
this projection/data validation process by default for files/diretories
(which would provide the behavior as c above.
e) We should get more serious about providing useful virtual fields.  This
should include filename (similar to directory name).

Once a record leaves an operator, I don't think we should carry any
additional provenance with it. It would be too heavy weight as a default
behavior.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <as...@maprtech.com> wrote:

> Drill can point out the filename and location of corrupted records in a
> file but we don't have a good mechanism to deal with the following
> scenario:
>
> Consider a text file with 2 records:
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
>
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
>
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as
> bigint) from dfs.`/Users/asinha/data/t4.csv`;
>
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
>
> Fragment 0:0
>
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
>
>   (java.lang.NumberFormatException) http://www.cnn.com
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
>
> The problem is user does not have a clue about the original source of this
> error.  This is a pain point especially when dealing with thousands of
> files.
>
> 1.  We can start by providing the column index where the problem occurred.
> 2.  Can a scan batch keep track of the file it originated from ? Since the
> Project in the
>      above query is pushed right above the scan, it could get the filename
> from the record
>      batch (assuming we can store this piece of information).  This won't
> be possible
>      for other Projects elsewhere in the plan.
> 3.  What about the location within the file ?   Unless the projection is
> pushed into the scan
>      itself, I don't see a good way to provide this information.
>
> A related topic is how to tell Drill to ignore such records when doing a
> query or a CTAS ?
> That could be a separate discussion.
>
> Thoughts ?
> Aman
>