You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 15:06:17 UTC

[GitHub] [beam] damccorm opened a new issue, #17832: Implement a CSV file reader

damccorm opened a new issue, #17832:
URL: https://github.com/apache/beam/issues/17832

   We should implement a CSV-based source.
   
   One possibility would be to support the same options as BigQuery. https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats These options are:
   
   fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is this is critical. One common delimiter that people use is 'thorn' (รพ).
   
   quote: Custom quote char. By default, this is '"', but this allows users to set it to something else, or, perhaps more commonly, remove it entirely (by setting it to the empty string). For example, tab-separated files generally don't need quotes.
   
   allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single line. This makes splitting of large csv files impossible, so we should disallow quoted newlines by default unless the user really wants them (in which case, they'll get worse performance).
   
   allowJaggedRows: This allows inferring null if not enough columns are specified. Otherwise we give an error for the row.
   
   ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a user has _too_ many values for the schema, we will ignore the ones we don't recognize, rather than reporting an error for the row.
   
   skipHeaderRows: How many header lines are in the file.
   
   encoding: UTF8-vs latin1, etc.
   compression: gzip, bzip, etc.
   
   Imported from Jira [BEAM-51](https://issues.apache.org/jira/browse/BEAM-51). Original Jira may contain additional context.
   Reported by: dhalperi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm closed issue #17832: Implement a CSV file reader

Posted by GitBox <gi...@apache.org>.
damccorm closed issue #17832: Implement a CSV file reader
URL: https://github.com/apache/beam/issues/17832


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org