You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Lars Bruun-Hansen (Jira)" <ji...@apache.org> on 2019/10/29 15:45:00 UTC
[jira] [Comment Edited] (CSV-253) Handle absent values in input
(null)
[ https://issues.apache.org/jira/browse/CSV-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962105#comment-16962105 ]
Lars Bruun-Hansen edited comment on CSV-253 at 10/29/19 3:44 PM:
-----------------------------------------------------------------
[~ggregory] Sorry, the whole point of the PR-51 is that {{nullString}} cannot handle the issue at hand. {{nullString}} feature indeed fulfills a different purpose. Something else is required.
Example:
The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
What happens when using the {{nullString}} feature to tackle the problem is summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|
As can be seen, there is no way to achieve the desired result. This is essentially because Apache CSV at the moment has no concept of what I call an _absent value_. To the Lexer, element2 and element3 have the same value. They dont!
With the PR the parser becomes aware of the difference between element2 and element3.
You can also see [this question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields] on SO. In one of the answers, the Apache CSV library is getting lamented for not being able to handle this situation. This is unfortunately correct.
h3. Why two settings?
Of course there's a certain conceptual overlap between the proposed new setting on formatter, {{absentIsNull}}, and the existing {{nullString}} and if the library was designed again from scratch then they could probably be conflated. But now we have the history, and the way {{nullString}} works cannot be touched as it would break backwards compatibility. Also I believe 99.9% percent of users of the library would actually want to parse an absent value as null, but I don't dare to propose that as a new default as it would break backwards compatibility. Hence, I propose a new setting on Formatter and I propose it to be an opt-in feature.
was (Author: lbruun):
[~ggregory] Sorry, the whole point of the PR-51 is that {{nullString}} cannot handle the issue at hand. {{nullString}} feature indeed fulfills a different purpose. Something else is required.
Example:
The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
What happens when using the {{nullString}} feature to tackle the problem is summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|
As can be seen, there is no way to achieve the desired result. This is essentially because Apache CSV at the moment has no concept of what I call an _absent value_. To the Lexer, element2 and element3 have the same value. They dont!
With the PR the parser becomes aware of the difference between element2 and element3.
You can also see [this question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields] on SO. In one of the answers, the Apache CSV library is getting lamented for not being able to handle this situation. This is unfortunately correct.
h3. Why two settings?
Of course there's a certain conceptual overlap between the proposed new setting on formatter, {{absentIsNull}}, and the existing {{nullString}} and if the library was designed from again scratch then they could probably be conflated. But now we have the history, and the way {{nullString}} works cannot be touched as it would break backwards compatibility. Also I believe 99.9% percent of users of the library would actually want to parse an absent value as null, but I don't dare to propose that as a new default as it would break backwards compatibility. Hence, I propose a new setting on Formatter and I propose it to be an opt-in feature.
> Handle absent values in input (null)
> ------------------------------------
>
> Key: CSV-253
> URL: https://issues.apache.org/jira/browse/CSV-253
> Project: Commons CSV
> Issue Type: Improvement
> Components: Parser
> Reporter: Lars Bruun-Hansen
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The parser must be able to handle absent values in input and translate that into {{null}} as required. I see several tickets on this matter in the history, but none seem to have addressed the issue, at least not for parsing.
> For this problem, I see a need to introduce a new term:
> Definition: _Absent value_ is when there are zero characters between field delimiters.
> Specifically the aim is to be able to parse the following:
> {noformat}
> "John",,"Doe" // 2nd element is absent
> ,"AA",123 // 1st element is absent
> "John",90, // 3rd element is absent
> "",,90 // 2nd element is absent (1st element isn't)
> {noformat}
>
> See also CSV-93 which I think never addressed the issue, probably because the reporter was happy with having the issue fixed for CSV output, not for parsing.
> A PR is coming...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)