You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2022/10/28 10:52:00 UTC
[jira] [Commented] (CSV-264) Duplicate empty header names are allowed even with `.withAllowDuplicateHeaderNames(false)`

    [ https://issues.apache.org/jira/browse/CSV-264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625598#comment-17625598 ] 

Alex Herbert commented on CSV-264:
----------------------------------

Some inconsistency between the handling of duplicate headers between the CSVFormat and CSVParser has been corrected in git master. The CSVParser uses more settings to control validation. Validation is less strict in the CSVFormat as the settings are used for both parsing and writing data.
h2. CSVParser

The CSVFormat has a flag {{{}allowMissingColumnNames{}}}. This is only used by the CSVParser. If this is false then any column header that is null, the empty string "", or a whitespace string (blank) is identified as "missing" and an exception raised if the flag is false. Thus missing column names have historically been defined as any null or blank string in previous release versions. Note that in the CSVParser this behaviour occurs irrespective of the DuplicateHeaderMode. To activate DuplicateHeaderMode.ALLOW_EMPTY for parsing requires setting {{{}allowMissingColumnNames=true{}}}.

If allow duplicates is set to DISALLOW then any duplicate header will raise an exception. In the case of "missing" headers all variations of "missing" are considered the same. Thus an exception can be raised for null and "" as these are considered the same header for the purpose of duplicate checks.

If allow duplicates is set to ALLOW_EMPTY then any missing header will be allowed. This allows whitespace headers to be used to pad columns for formatting purposes. A side effect is a possible many-to-one mapping created by the CSVParser header map as multiple empty headers using the same string (e.g. "") will map to the last column in the input with this empty header. This is noted in the method javadoc for CSVParser.getHeaderMap().

The CSVFormat has a flag {{{}ignoreHeaderCase{}}}. This is only used by the CSVParser. A header of ["A", "a"] with {{ignoreHeaderCase=true}} and {{duplicateHeaderMode=DISALLOW}} will raise an exception in the CSVParser.
h2. CSVFormat

The CSVFormat validation is less strict than the CSVParser. It does not use the values for the flags {{allowMissingColumnNames}} or {{{}ignoreHeaderCase{}}}. The validation behaviour of CSVFormat matches that of CSVParser when {{{}allowMissingColumnNames=true and ignoreHeaderCase=false{}}}.

A header of ["A", "a"] with {{ignoreHeaderCase=true}} and {{duplicateHeaderMode=DISALLOW}} will not raise an exception in the CSVFormat.
h2. Behavioural Compatibility

The {{DuplicateHeaderMode}} enum replaces the flag {{{}allowDuplicateHeaders{}}}. For behavioural compatibility {{allowDuplicateHeaders=false}} will still allow duplicate empty headers:
||allowDuplicateHeaders (deprecated)||DuplicateHeaderMode||
|true|ALLOW_ALL|
|false|ALLOW_EMPTY|
| |DISALLOW|

This behaviour to allow duplicate empty headers dates back to CSV v1.0. To control empty headers the flag {{ignoreEmptyHeaders}} was added at the same time (see CSV-121).

Supporting this legacy behaviour means that CSVFormat validation, which does not use the {{ignoreEmptyHeaders}} flag, must use {{DuplicateHeaderMode.DISALLOW}} instead of {{allowDuplicateHeaders=false}} to check for duplicate empty headers. Validation of any missing headers is not currently possible in CSVFormat.
h2. Summary
 * Missing column headers are any of: null; ""; or blank strings.
 * All missing column headers are considered the same for duplicate checks
 * CSVParser must enable {{allowMissingColumnNames}} to be able to use the {{DuplicateHeaderMode.ALLOW_EMPTY}} behaviour.
 * CSVParser respects  the {{ignoreHeaderCase}} flag when checking for duplicates.

All current behaviour is tested in {{{}CSVDuplicateHeaderTest{}}}.

 

> Duplicate empty header names are allowed even with `.withAllowDuplicateHeaderNames(false)`
> ------------------------------------------------------------------------------------------
>
>                 Key: CSV-264
>                 URL: https://issues.apache.org/jira/browse/CSV-264
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>    Affects Versions: 1.8
>            Reporter: Sagar Tiwari
>            Priority: Major
>             Fix For: 1.10.0
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> I'm trying to parse to parse a csv like this:
>  
> {{CSVFormat.DEFAULT}}
> {{ .withHeader()}}
> {{ .withAllowDuplicateHeaderNames(false)}}
> {{ .withAllowMissingColumnNames()}}
> {{ .parse(InputStreamReader(FileInputStream(fl)))}}
>  
> One would expect this code to throw an error if the following csv is given as input:
>  
>  
> {{"","a",""}}
> {{"1","X","3"}}
> {{"3","Y","4"}}
>  
> But it doesn't, and asking for `record.get("")` gives the value from the second column. The first column is ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)