You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/07/04 20:24:00 UTC
[jira] [Comment Edited] (DRILL-5239) Drill text reader reports wrong results when column value starts with '#'

    [ https://issues.apache.org/jira/browse/DRILL-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074056#comment-16074056 ] 

Paul Rogers edited comment on DRILL-5239 at 7/4/17 8:23 PM:
------------------------------------------------------------

See [RFC-4180|https://tools.ietf.org/html/rfc4180] for the IETF standard for CSV files. This standard does not support headers. Drill's "compliant" text reader complies with RFC-4180.

As noted in [Wikipedia|https://en.wikipedia.org/wiki/Comma-separated_values], despite RFC-4180, CSV is not a well-defined format; it is instead often an informal theme with each system handing details slightly differently. Indeed, Drill violates RFC-4180 in that we allow Linux-style line terminators instead of the DOS-style terminators specified in the RFC.

Also came across [Super CSV|https://super-csv.github.io/super-csv/csv_specification.html] which has a nice summary of the rules in a more readable format than the RFC.

Someone has specified a [JSON format|http://specs.frictionlessdata.io/csv-dialect/] to describe CSV file formats.

So, we can see that, in normal practice, comments are not a part of the RFC-4180 format. So, your suggestion to have them turned off by default is good.

Let's dig into comments a bit. Here is our [first hint|https://stackoverflow.com/questions/1961006/can-a-csv-file-have-a-comment], third answer down:

{quote}
In engineering data, it is common to see the # symbol in the first column used to signal a comment.

I use the [ostermiller CSV parsing library|http://ostermiller.org/utils/CSV.html] to read and process such files. That library allows you to set the comment character. After the parse operation you get an array just containing the real data, no comments.
{quote}

The [Ostermiller Java Utilities, Comma Separated Values (CSV)|http://ostermiller.org/utils/CSV.html] looks to be a good source for what people actually do in practice; the comments in the description suggest the author kept tinkering with the library to handle the various file formats "in the wild." Of particular interest is the difference between what he calls the "Unix stye" and "Excel style" of CSV files. I don't think our code handles these differences. The bottom of the page lists a number of other CSV resources.

Maybe we should just use the Ostermiller code rather than tinkering with the existing code? Especially since, as part of the "memory fragmentation" project, I've already ripped out and replaced the header parser and will need to replace the two implementations that write to value vectors...

We've talked about comments. Another issue is blank lines. The [Ostermiller CSV lexer|http://ostermiller.org/utils/src/CSVLexer.lex.html] says it ignores blank lines. Perhaps we should also, since a blank line can never be a valid record (unless the header states that the file has only one field... You gotta love CSV...)

That same page shows how the lexer handles comments:

{code}
shredder.setCommentStart("#;!");
{code}

In this case, a comment character starts the comment (which, this being a lexer, could be anywhere in the line.) If the resulting line is blank, it is ignored. Perhaps overkill...

A [very complete "informal" specification|https://www.csvreader.com/csv_format.php] of CSV-files in-the-wild says that comments, if they appear, should be the first character on the line. (Lots of good stuff in this link!) Here is the golden nugget (and where I saw the comment character on a previous project): "An IIS web log file is a good example." A Google search found [an example|http://surfray.com/?id=blog&news=38133] similar to the one I made up in a previous post.

{code}
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2009-06-11 05:12:03
#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
2009-06-11 05:12:02 W3SVC1893743816 192.168.1.109 GET / – 4677 – 192.168.1.109 Mozilla/4.0 (compatible;+MSIE+4.01;+Windows+NT;+MS+Search+5.0+Robot) 401 2 2148074254 2009-06-11 05:12:02 W3SVC1893743816 192.168.1.109 GET / – 4677 – 192.168.1.109 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+NT;+MS+Search+5.0+Robot) 401 2 2148074254
{code}

Note, however, that the above format is not rally CSV. Field names are in a comment, not a CSV header. Values are space-delimited (not comma delimited), though I've also seen tab-delimited variations. To read the above properly requires a specialized parser. But, still, one can get close with a SSV (space separated value) parser that ignores comments. Drill would read all fields into one big array, which is awful because different servers can be configured with different fields. We'll leave that issue for another time.

Drill's CSV parser is based on the [uniVocity CSV parser|http://docs.univocity.com/parsers/2.4.1/com/univocity/parsers/csv/CsvParser.html], now maintained on [GitHub|https://github.com/uniVocity/univocity-parsers]. A good question is whether we can just use the jar available from the GitHub site rather than copying the code as we have done.

Go down to the [Examples|https://github.com/uniVocity/univocity-parsers#reading-csv] section. Voila! There is a CSV file in all its glory with blank lines and comments!

So, let us set our goal to read this [example file|http://github.com/uniVocity/univocity-parsers/tree/master/src/test/resources/examples/example.csv]. We need a set of options that allow us to do so. This means:

* File has headers (true/*false*)
* Read headers (true/*false*)
* Skip blank lines (*true*/false)
* Comment character (any single character, blank by default meaning no comment)
* Unix extensions (true/*false*)

If comments are enabled, a line may have leading whitespace before the comment character. And, in a file with comments, a value can contain a comment by quoting:

{code}
# A file with comment
col1, col2, col3
# Next line is a comment
#fred,10,bedrock
# So is this next one
    #fred,10,bedrock
# Next line is not a comment (Excel style)
"#fred",10,bedrock
# Next line is not a comment (Unix style)
\#fred,10,bedrock
# Not a comment, leading space in value (i.e. "  #fred"), Unix style
   \#fred,10,bedrock
{code}

The lesson in all this is that CSV files have a long history and exist in the field with many variations. Mature parsers have learned to handle these variations. Because Drill must handle data as it actually exists, not as we'd like it to be, Drill would be well served to learn from those who came before us and to correctly support the variations encoded in these various CSV libraries. Fortunately for us, if we just use the (newer? complete?) uniVocity libraries, we get that for free.


was (Author: paul-rogers):
See [RFC-4180|https://tools.ietf.org/html/rfc4180] for the IETF standard for CSV files. This standard does not support headers. Drill's "compliant" text reader complies with RFC-4180.

As noted in [Wikipedia|https://en.wikipedia.org/wiki/Comma-separated_values], despite RFC-4180, CSV is not a well-defined format; it is instead often an informal theme with each system handing details slightly differently. Indeed, Drill violates RFC-4180 in that we allow Linux-style line terminators instead of the DOS-style terminators specified in the RFC.

Also came across [Super CSV|https://super-csv.github.io/super-csv/csv_specification.html] which has a nice summary of the rules in a more readable format than the RFC.

Someone has specified a [JSON format|http://specs.frictionlessdata.io/csv-dialect/] to describe CSV file formats.

So, we can see that, in normal practice, comments are not a part of the RFC-4180 format. So, your suggestion to have them turned off by default is good.

Let's dig into comments a bit. Here is our [first hint|https://stackoverflow.com/questions/1961006/can-a-csv-file-have-a-comment], third answer down:

{quote}
In engineering data, it is common to see the # symbol in the first column used to signal a comment.

I use the [ostermiller CSV parsing library|http://ostermiller.org/utils/CSV.html] to read and process such files. That library allows you to set the comment character. After the parse operation you get an array just containing the real data, no comments.
{quote}

The [Ostermiller Java Utilities, Comma Separated Values (CSV)|http://ostermiller.org/utils/CSV.html] looks to be a good source for what people actually do in practice; the comments in the description suggest the author kept tinkering with the library to handle the various file formats "in the wild." Of particular interest is the difference between what he calls the "Unix stye" and "Excel style" of CSV files. I don't think our code handles these differences. The bottom of the page lists a number of other CSV resources.

Maybe we should just use the Ostermiller code rather than tinkering with the existing code? Especially since, as part of the "memory fragmentation" project, I've already ripped out and replaced the header parser and will need to replace the two implementations that write to value vectors...

We've talked about comments. Another issue is blank lines. The [Ostermiller CSV lexer|http://ostermiller.org/utils/src/CSVLexer.lex.html] says it ignores blank lines. Perhaps we should also, since a blank line can never be a valid record (unless the header states that the file has only one field... You gotta love CSV...)

That same page shows how the lexer handles comments:

{code}
shredder.setCommentStart("#;!");
{code}

In this case, a comment character starts the comment (which, this being a lexer, could be anywhere in the line.) If the resulting line is blank, it is ignored. Perhaps overkill...

Drill's CSV parser is based on the [uniVocity CSV parser|http://docs.univocity.com/parsers/2.4.1/com/univocity/parsers/csv/CsvParser.html], now maintained on [GitHub|https://github.com/uniVocity/univocity-parsers]. A good question is whether we can just use the jar available from the GitHub site rather than copying the code as we have done.

Go down to the [Examples|https://github.com/uniVocity/univocity-parsers#reading-csv] section. Voila! There is a CSV file in all its glory with blank lines and comments!

So, let us set our goal to read this [example file|http://github.com/uniVocity/univocity-parsers/tree/master/src/test/resources/examples/example.csv]. We need a set of options that allow us to do so. This means:

* File has headers (true/*false*)
* Read headers (true/*false*)
* Skip blank lines (*true*/false)
* Comment character (any single character, blank by default meaning no comment)
* Unix extensions (true/*false*)

If comments are enabled, a line may have leading whitespace before the comment character. And, in a file with comments, a value can contain a comment by quoting:

{code}
# A file with comment
col1, col2, col3
# Next line is a comment
#fred,10,bedrock
# So is this next one
    #fred,10,bedrock
# Next line is not a comment (Excel style)
"#fred",10,bedrock
# Next line is not a comment (Unix style)
\#fred,10,bedrock
# Not a comment, leading space in value (i.e. "  #fred"), Unix style
   \#fred,10,bedrock
{code}

The lesson in all this is that CSV files have a long history and exist in the field with many variations. Mature parsers have learned to handle these variations. Because Drill must handle data as it actually exists, not as we'd like it to be, Drill would be well served to learn from those who came before us and to correctly support the variations encoded in these various CSV libraries. Fortunately for us, if we just use the (newer? complete?) uniVocity libraries, we get that for free.

> Drill text reader reports wrong results when column value starts with '#'
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5239
>                 URL: https://issues.apache.org/jira/browse/DRILL-5239
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Roman
>            Priority: Blocker
>              Labels: doc-impacting
>
> git.commit.id.abbrev=2af709f
> Data Set :
> {code}
> D|32
> 8h|234
> ;#|3489
> ^$*(|308
> #|98
> {code}
> Wrong Result : (Last row is missing)
> {code}
> select columns[0] as col1, columns[1] as col2 from dfs.`/drill/testdata/wtf2.tbl`;
> +-------+-------+
> | col1  | col2  |
> +-------+-------+
> | D     | 32    |
> | 8h    | 234   |
> | ;#    | 3489  |
> | ^$*(  | 308   |
> +-------+-------+
> 4 rows selected (0.233 seconds)
> {code}
> The issue does not however happen with a parquet file



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)