You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/07/03 18:06:00 UTC

[jira] [Commented] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

    [ https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072784#comment-16072784 ] 

Paul Rogers commented on DRILL-5239:
------------------------------------

Note that the '#' symbol is used in some formats to indicate a comment:

{code}
# Exported from server abcd at 2017:07:01T01:00:00
# Server log version 2.3
time,recv-ip,bytes,status,...
<data>
{code}

Since some formats allow this, a solution might be to add an option (normally off) that permits comments in headers. If the rule is off, then '#' is just another character. If it is on, then we skip comment lines until we find a header. In neither case do we need to allow comment lines in the data section.

There is probably a write-up of this somewhere for a format that allows columns. Perhaps we can track that down as a reference. (I saw the format in conjunction with web logs a few jobs back...)

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)