You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2017/08/17 23:22:00 UTC

[jira] [Updated] (BEAM-2776) TextIO should support reading header lines

     [ https://issues.apache.org/jira/browse/BEAM-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Kirpichov updated BEAM-2776:
-----------------------------------
    Description: 
Users frequently request the ability to skip some header rows when reading text files.


https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow

This is also relevant for reading file formats such as VCF, see thread https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E

Python supports this partially https://github.com/apache/beam/pull/1771/files via skip_header_lines, but the header lines can have useful content, and the number of header lines is not fixed (in VCF).

We should figure out a good API for this and support this natively in TextIO. The API decisions would be:

- How do we specify how much of the beginning of each file is the header: options could be e.g. a certain number of lines; or lines that start with a certain character; or a custom predicate.
- How do we make the header contents accessible to a user of TextIO. Since the header can be different in each file, we can't return it as a PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or something like that for parsing (header, line) -> user type. Note that currently TextIO.Read does not support returning a user type anyway, so that'd need to be done too.

  was:
Users frequently request the ability to skip some header rows when reading text files.


https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow

This is also relevant for reading file formats such as VCF, see thread https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E

Python supports this partially https://github.com/apache/beam/pull/1771/files via skip_header_lines, but the header lines can have useful content, and the number of header lines is not fixed (in VCF).

We should figure out a good API for this and support this natively in TextIO. The API decisions would be:

- How do we specify how much of the beginning of each file is the header: options could be e.g. a certain number of lines; or lines that start with a certain character.
- How do we make the header contents accessible to a user of TextIO. Since the header can be different in each file, we can't return it as a PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or something like that for parsing (header, line) -> user type. Note that currently TextIO.Read does not support returning a user type anyway, so that'd need to be done too.


> TextIO should support reading header lines
> ------------------------------------------
>
>                 Key: BEAM-2776
>                 URL: https://issues.apache.org/jira/browse/BEAM-2776
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>
> Users frequently request the ability to skip some header rows when reading text files.
> https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
> https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
> https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
> https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
> https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow
> This is also relevant for reading file formats such as VCF, see thread https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E
> Python supports this partially https://github.com/apache/beam/pull/1771/files via skip_header_lines, but the header lines can have useful content, and the number of header lines is not fixed (in VCF).
> We should figure out a good API for this and support this natively in TextIO. The API decisions would be:
> - How do we specify how much of the beginning of each file is the header: options could be e.g. a certain number of lines; or lines that start with a certain character; or a custom predicate.
> - How do we make the header contents accessible to a user of TextIO. Since the header can be different in each file, we can't return it as a PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or something like that for parsing (header, line) -> user type. Note that currently TextIO.Read does not support returning a user type anyway, so that'd need to be done too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)