You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2020/02/14 15:27:00 UTC
[jira] [Updated] (DRILL-7588) Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row

     [ https://issues.apache.org/jira/browse/DRILL-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

benj updated DRILL-7588:
------------------------
    Description: 
With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n).
 The file contains some special char like
{noformat}
http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â± Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
{noformat}
The next request sometimes eat the first char of a line
{code:sql}
--CREATE TABLE dfs.test.`result_pqt` AS (
SELECT 
  columns[0] as d
 ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n'))
--)
java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM"
{code}
The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv).

If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt)

keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?)
keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?)

Didn't manage to reduce more the file demo.tsv while keeping the problem.

    Environment:     (was: With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n).
 The file contains some special char like
{noformat}
http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â± Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
{noformat}
The next request sometimes eat the first char of a line
{code:sql}
--CREATE TABLE dfs.test.`result_pqt` AS (
SELECT 
  columns[0] as d
 ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n'))
--)
java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM"
{code}
The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv).

If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt)

keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?)
keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?)

Didn't manage to reduce more the file demo.tsv while keeping the problem.


 


)

> Function TABLE + option lineDelimiter = '\r\n' eats sometime first char of a row
> --------------------------------------------------------------------------------
>
>                 Key: DRILL-7588
>                 URL: https://issues.apache.org/jira/browse/DRILL-7588
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.17.0
>            Reporter: benj
>            Priority: Major
>         Attachments: demo.tsv.gz, drill_json_profile_tsv.log, drill_tsv.log
>
>
> With a TSV file ([#demo.tsv.gz] in attachment) generated on _Windows_ (EOL = \r\n).
>  The file contains some special char like
> {noformat}
> http://bouzbal-fans.blogspot.com/search/label/Ã\230Â£Ã\230Â®Ã\230Â¨Ã\230Â§Ã\230Â± Ã\230Â¨Ã\231Ë\206Ã\230Â²Ã\230Â¨Ã\230Â§Ã\231â\200\236
> {noformat}
> The next request sometimes eat the first char of a line
> {code:sql}
> --CREATE TABLE dfs.test.`result_pqt` AS (
> SELECT 
>   columns[0] as d
>  ,CAST(to_timestamp(columns[0],'MM/dd/yy HH:mm:ss a') AS TIMESTAMP) 
> FROM TABLE(dfs.test.`demo.tsv` (type => 'text', extractHeader => false, fieldDelimiter => '\t', lineDelimiter => '\r\n'))
> --)
> java.sql.SQLException: SYSTEM ERROR: IllegalArgumentException: Invalid format: "/19/2015 9:33:39 AM"
> {code}
> The string "^/19/2015 9:33:39 AM" doesn't exists. Month is already present in this field in the TSV (so here there is "3/19/2015 9:33:39 AM" in the file demo.tsv).
> If '\r\n' are replaced by '\n' with _sed_ before the request, the result is correct as well with *lineDelimiter => '\r\n'* as *lineDelimiter => '\n'* or without function TABLE (there is no error and the date is correctly converted with to_timestamp function / columns d is correct in the result_pqt)
> keeping '\r\n' and trying to move (in another line in demo.tsv) the line that produce error can prevent error (why ?)
> keeping '\r\n' and trying to remove/modify one or more special char (like in "thá»\235i trang jean") can prevent error (why ?)
> Didn't manage to reduce more the file demo.tsv while keeping the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)