You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by "Luke Cwik (JIRA)" <ji...@apache.org> on 2016/04/04 22:21:25 UTC

[jira] [Assigned] (BEAM-167) TextIO can't read concatenated gzip files

     [ https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Luke Cwik reassigned BEAM-167:
------------------------------

    Assignee: Luke Cwik

> TextIO can't read concatenated gzip files
> -----------------------------------------
>
>                 Key: BEAM-167
>                 URL: https://issues.apache.org/jira/browse/BEAM-167
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>            Reporter: Eugene Kirpichov
>            Assignee: Luke Cwik
>
> $ cat <<END > header.csv
> a,b,c
> END
> $ cat <<END > body.csv
> 1,2,3
> 4,5,6
> 7,8,9
> END
> $ gzip -c header.csv > file.gz
> $ gzip -c body.csv >> file.gz
> The file is well-formed:
> $ gzip -dc file.gz
> a,b,c
> 1,2,3
> 4,5,6
> 7,8,9
> However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - reproducible even when the file is on local disk and with the DirectPipelineRunner.
> The bug is in CompressedSource. It uses GzipCompressorInputStream, which by default reads only the first gzip stream in the file, but has an option to read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream which reads all streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)