You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Sean Hsuan-Yi Chu (JIRA)" <ji...@apache.org> on 2015/10/09 01:08:27 UTC

[jira] [Commented] (DRILL-3764) Support the ability to identify and/or skip records when a function evaluation fails

    [ https://issues.apache.org/jira/browse/DRILL-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949575#comment-14949575 ] 

Sean Hsuan-Yi Chu commented on DRILL-3764:
------------------------------------------

The option might not be viable if the data type is non-nullable. 

Further, we cannot just cast it to nullable data type since the batches prior to the current one might have been sent to the downstream operator. And changing the type to nullable would cause SchemaChange issues.

> Support the ability to identify and/or skip records when a function evaluation fails
> ------------------------------------------------------------------------------------
>
>                 Key: DRILL-3764
>                 URL: https://issues.apache.org/jira/browse/DRILL-3764
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>    Affects Versions: 1.1.0
>            Reporter: Aman Sinha
>             Fix For: Future
>
>
> Drill can point out the filename and location of corrupted records in a file but it does not have a good mechanism to deal with the following scenario: 
> Consider a text file with 2 records:
> {code}
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
> {code}
> {code}
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as bigint) from dfs.`t4.csv`;
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> Fragment 0:0
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
>   (java.lang.NumberFormatException) http://www.cnn.com
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>     org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> {code}
> The problem is user does not have the context of where the error occurred -either the file name or the record number.   This becomes a pain point especially when CTAS is being used to do data conversion from (say) text format to Parquet format.  The CTAS may be accessing thousands of files and 1 such casting (or another function) failure aborts the query. 
> It would substantially improve the user experience if we provided: 
> 1) the filename and record number where  this failure occurred
> 2) the ability to skip such records depending on a session option
> 3) the ability to write such records to a staging table for future ingestion
> Please see discussion on dev list: 
> http://mail-archives.apache.org/mod_mbox/drill-dev/201509.mbox/%3cCAFyDVvLuPLgTNZ56S6=J=9Vb=aBs=pDw7NRHKkdUPbdxGFAdcg@mail.gmail.com%3e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)