You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2020/02/20 09:30:00 UTC

[jira] [Commented] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row

    [ https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040793#comment-17040793 ] 

benj commented on DRILL-7588:
-----------------------------

As another possible solution:

In the case of windows file with \r\n EOL, it's possible to use the '\n' as line delimiter to avoid the problem described above.
 But in this case the last field will have a \r included at the end. But if we know this last field it does not matter because it's possible to do a REGEXP_REPLACE(last_field,'\r$','').
 But it's not really satisfactory.

> Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row
> --------------------------------------------------------------------------------
>
>                 Key: DRILL-7588
>                 URL: https://issues.apache.org/jira/browse/DRILL-7588
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.17.0
>            Reporter: benj
>            Priority: Major
>         Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log
>
>
> With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n).
>  The file contains some special char like
> {noformat}
> http://bouzbal-fans.blogspot.com/search/label/Ã\230£Ã\230®Ã\230¨Ã\230§Ã\230± Ã\230¨Ã\231Ë\206Ã\230²Ã\230¨Ã\230§Ã\231â\200\236
> {noformat}
> The next request sometimes eat the first char of a line
> {code:sql}
> --CREATE TABLE dfs.test.`result_pqt` AS (
> SELECT 
>   columns[0] as d
>  ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
> FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n'))
> --)
> java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM"
> {code}
> The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv).
> If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt)
> keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?)
> keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?)
> Didn't manage to reduce more the file demo.tsv while keeping the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)