You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 18:37:16 UTC

[GitHub] [beam] kennknowles opened a new issue, #18460: TextIO should support reading header lines

kennknowles opened a new issue, #18460:
URL: https://github.com/apache/beam/issues/18460

   Users frequently request the ability to skip some header rows when reading text files.
   
   
   https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
   https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
   https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
   https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
   https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow
   
   This is also relevant for reading file formats such as VCF, see thread https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E
   
   Python supports this partially https://github.com/apache/beam/pull/1771/files via skip_header_lines, but the header lines can have useful content, and the number of header lines is not fixed (in VCF).
   
   We should figure out a good API for this and support this natively in TextIO. The API decisions would be:
   
   - How do we specify how much of the beginning of each file is the header: options could be e.g. a certain number of lines; or lines that start with a certain character; or a custom predicate.
   - How do we make the header contents accessible to a user of TextIO. Since the header can be different in each file, we can't return it as a PCollectionView<List<String\>\>. Instead I suppose, when you use a header, you'd need to specify a SerializableFunction<KV<List<String\>, String\>, T\> or something like that for parsing (header, line) -\> user type. Note that currently TextIO.Read does not support returning a user type anyway, so that'd need to be done too.
   
   Imported from Jira [BEAM-2776](https://issues.apache.org/jira/browse/BEAM-2776). Original Jira may contain additional context.
   Reported by: jkff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org