You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2017/08/17 23:21:00 UTC

[jira] [Created] (BEAM-2776) TextIO should support reading header lines

Eugene Kirpichov created BEAM-2776:
--------------------------------------

             Summary: TextIO should support reading header lines
                 Key: BEAM-2776
                 URL: https://issues.apache.org/jira/browse/BEAM-2776
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-core
            Reporter: Eugene Kirpichov


Users frequently request the ability to skip some header rows when reading text files.


https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow

This is also relevant for reading file formats such as VCF, see thread https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E

Python supports this partially https://github.com/apache/beam/pull/1771/files via skip_header_lines, but the header lines can have useful content, and the number of header lines is not fixed (in VCF).

We should figure out a good API for this and support this natively in TextIO. The API decisions would be:

- How do we specify how much of the beginning of each file is the header: options could be e.g. a certain number of lines; or lines that start with a certain character.
- How do we make the header contents accessible to a user of TextIO. Since the header can be different in each file, we can't return it as a PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or something like that for parsing (header, line) -> user type. Note that currently TextIO.Read does not support returning a user type anyway, so that'd need to be done too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)