You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Sloane, Brandon" <bs...@tresys.com> on 2019/12/01 16:36:26 UTC

Re: RFC-compliant DFDL schema for parsing and unparsing CSV files

Can we add this to the DFDLSchemas/CSV repository?
________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Saturday, November 30, 2019 8:05 AM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: RFC-compliant DFDL schema for parsing and unparsing CSV files

Hi Folks,

Here is my RFC-compliant DFDL schema for CSV:

http://www.xfront.com/DFDL/DFDL-schema-for-CSV.zip

Here is a description of my DFDL schema:

This DFDL schema describes the CSV file format, as specified in RFC 4180. The RFC says this:

    TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

That means, to be standards-compliant, a CSV file must only contain printable ASCII characters. That seemed a bit limiting so I contacted the editor of the RFC, Yakov Shafranovich, and asked him about that. He kindly responded and said this: The original RFC was set to ASCII  only but when RFC 7111 was published, the media type was updated to use UTF-8. See:
https://lists.w3.org/Archives/Public/public-csv-wg/2014Oct/0115.html
https://www.iana.org/assignments/media-types/text/csv

The first reference says: While RFC 4180 does mandate ASCII, for standards purposes this has been changed and the default now is in fact UTF-8.

Okay, so CSV can contain more than just ASCII characters. Phew!

The second reference says: The "charset" parameter specifies the charset employed by the CSV content.

Okay, so I parameterized the below DFDL schema: when you run a DFDL processor on this schema, feed in a value for the charset parameter. The allowable values are UTF-8 or ASCII (case sensitive!).

I just found another RFC for CSV: RFC 7111. Its introduction says this: This memo updates the text/csv media type defined in RFC 4180 by defining URI fragment identifiers for text/csv MIME entities.

Hmm, I better read that document ... Okay, I read RFC 7111. It doesn't modify the CSV format, except to say that a charset parameter may be used to specify the charset employed by the CSV content. The RFC  describes how to reference portions of a CSV file using fragment identifiers on a URL. That's not relevant to describing the CSV format.

Why did I create a DFDL schema for CSV? Last month I was browsing the Web and came across a web site
(http://www.hexacorn.com/blog/2019/09/06/state-machine-vs-regex/)
that said something very interesting:

        The enlightenment came from reading
         the actual CSV specification. When you
         skim through it you quickly realize two
         things:
        1. No one ever reads stuff like this anymore
        2. It's unlikely that anyone covers all angles
            while saving files in this format
                 The result is that we have many badly
         implemented CSV parsers out there. You also
         realize why: this format is NOT as simple
         as many people think. Quite the opposite,
         even today, after so many years, even
         Excel (which is actually including a lot
         of margin for error!) still fails to
         process some of these files correctly...

After reading that I thought, "Hey, using DFDL I should be able to write a parser that covers all angles of the CSV file format." And I did!

The following DFDL schema precisely describes the CSV data format. Here's a summary of what this DFDL schema expresses:
1. A CVS file consists of a one or more records separated by newlines.
2. The last record may or may not have an ending newline.
3. A record consists of one or more fields, separated by commas (or some other symbol).
4. Spaces are considered part of a field and may not be ignored.
5. A CSV file may or may not have a header. If present, it is the first line. A header consists of one or more names, separated by commas. The header is separated from the records by a newline.
6. A "header" parameter with value "present" means there is a header, "absent" means there is no header.
7. Each record should contain the same number of fields as names in the header, if present. If the header is not present, then each record should contain the same number of fields as the other records.
8. A field may be wrapped in double quotes.
9. Commas in a field that is wrapped in double quotes are ignored, i.e. the commas are not to be treated as field separators.
10. Newlines in a field that is wrapped in double quotes are ignored, i.e. the newlines are not to be treated as record separators.
11. A double quote within a field must be escaped by a double quote.
12. A "charset" parameter with value "UTF-8" means the file may contain any UTF-8 character, "ASCII" means the file may only contain ASCII characters.


Re: RFC-compliant DFDL schema for parsing and unparsing CSV files

Posted by Steve Lawrence <sl...@apache.org>.
Yeah, this looks really nice and should definitely be on the DFDLSchemas
repo, with Rogers permission.

A couple minor comments:

1. The DFDL specification predefines a few variables in the dfdl
namespace [1], one of which is "dfdl:encoding". This has the same
function as your charset variable. It might make sense to use the
predefined one instead, so your dfdl:format tag would look like this:

  <dfdl:format ref="default-dfdl-properties" encoding="{ $dfdl:encoding
}" />

and you wouldn't need to define an extra variable.

2. Your "header" variable is used to determine whether or not to parse
the header element, and its values should be either "present" or
"absent". Since this can really only have two values, it might make
sense to change the variable name to "hasHeader" and make it a boolean
with either "true" or "false" values.

3.  DFDLSchemas isn't part of Apache, so the licensing isn't as strict,
but this should still have a license applied to it. Obviously, I'd
recommend the Apache License, but ultimately it's up to Roger.


[1] https://daffodil.apache.org/docs/dfdl/#_Toc398030690

On 12/1/19 11:36 AM, Sloane, Brandon wrote:
> Can we add this to the DFDLSchemas/CSV repository?
> --------------------------------------------------------------------------------
> *From:* Costello, Roger L. <co...@mitre.org>
> *Sent:* Saturday, November 30, 2019 8:05 AM
> *To:* users@daffodil.apache.org <us...@daffodil.apache.org>
> *Subject:* RFC-compliant DFDL schema for parsing and unparsing CSV files
> Hi Folks,
> 
> Here is my RFC-compliant DFDL schema for CSV:
> 
> http://www.xfront.com/DFDL/DFDL-schema-for-CSV.zip
> 
> Here is a description of my DFDL schema:
> 
> This DFDL schema describes the CSV file format, as specified in RFC 4180. The 
> RFC says this:
> 
>      TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
> 
> That means, to be standards-compliant, a CSV file must only contain printable 
> ASCII characters. That seemed a bit limiting so I contacted the editor of the 
> RFC, Yakov Shafranovich, and asked him about that. He kindly responded and said 
> this: The original RFC was set to ASCII  only but when RFC 7111 was published, 
> the media type was updated to use UTF-8. See:
> https://lists.w3.org/Archives/Public/public-csv-wg/2014Oct/0115.html
> https://www.iana.org/assignments/media-types/text/csv
> 
> The first reference says: While RFC 4180 does mandate ASCII, for standards 
> purposes this has been changed and the default now is in fact UTF-8.
> 
> Okay, so CSV can contain more than just ASCII characters. Phew!
> 
> The second reference says: The "charset" parameter specifies the charset 
> employed by the CSV content.
> 
> Okay, so I parameterized the below DFDL schema: when you run a DFDL processor on 
> this schema, feed in a value for the charset parameter. The allowable values are 
> UTF-8 or ASCII (case sensitive!).
> 
> I just found another RFC for CSV: RFC 7111. Its introduction says this: This 
> memo updates the text/csv media type defined in RFC 4180 by defining URI 
> fragment identifiers for text/csv MIME entities.
> 
> Hmm, I better read that document ... Okay, I read RFC 7111. It doesn't modify 
> the CSV format, except to say that a charset parameter may be used to specify 
> the charset employed by the CSV content. The RFC  describes how to reference 
> portions of a CSV file using fragment identifiers on a URL. That's not relevant 
> to describing the CSV format.
> 
> Why did I create a DFDL schema for CSV? Last month I was browsing the Web and 
> came across a web site
> (http://www.hexacorn.com/blog/2019/09/06/state-machine-vs-regex/)
> that said something very interesting:
> 
>          The enlightenment came from reading
>           the actual CSV specification. When you
>           skim through it you quickly realize two
>           things:
>          1. No one ever reads stuff like this anymore
>          2. It's unlikely that anyone covers all angles
>              while saving files in this format
>                   The result is that we have many badly
>           implemented CSV parsers out there. You also
>           realize why: this format is NOT as simple
>           as many people think. Quite the opposite,
>           even today, after so many years, even
>           Excel (which is actually including a lot
>           of margin for error!) still fails to
>           process some of these files correctly...
> 
> After reading that I thought, "Hey, using DFDL I should be able to write a 
> parser that covers all angles of the CSV file format." And I did!
> 
> The following DFDL schema precisely describes the CSV data format. Here's a 
> summary of what this DFDL schema expresses:
> 1. A CVS file consists of a one or more records separated by newlines.
> 2. The last record may or may not have an ending newline.
> 3. A record consists of one or more fields, separated by commas (or some other 
> symbol).
> 4. Spaces are considered part of a field and may not be ignored.
> 5. A CSV file may or may not have a header. If present, it is the first line. A 
> header consists of one or more names, separated by commas. The header is 
> separated from the records by a newline.
> 6. A "header" parameter with value "present" means there is a header, "absent" 
> means there is no header.
> 7. Each record should contain the same number of fields as names in the header, 
> if present. If the header is not present, then each record should contain the 
> same number of fields as the other records.
> 8. A field may be wrapped in double quotes.
> 9. Commas in a field that is wrapped in double quotes are ignored, i.e. the 
> commas are not to be treated as field separators.
> 10. Newlines in a field that is wrapped in double quotes are ignored, i.e. the 
> newlines are not to be treated as record separators.
> 11. A double quote within a field must be escaped by a double quote.
> 12. A "charset" parameter with value "UTF-8" means the file may contain any 
> UTF-8 character, "ASCII" means the file may only contain ASCII characters.
>