You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Tibor Kiss (JIRA)" <ji...@apache.org> on 2017/03/24 12:32:41 UTC

[jira] [Commented] (BEAM-778) Make fileio._CompressedFile seekable.

    [ https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940239#comment-15940239 ] 

Tibor Kiss commented on BEAM-778:
---------------------------------

The implementation today maintains a local {{_read_buffer}} object which is used all the way on the read path.
I suspect that the _read_buffer is created to bridge the gap between zlib module's functionality (provides only block
decompress and compress) and the required operations of the file object (read bytes and read line).

My impression is that if we would replace zlib module with gzip (which builds on top of gzip) then we could simply 
bridge the read operations to gzip's respective methods without the need of having local buffer. 
Bzip2 module also supports read operations.
Bonus would be that seek() functionality would come for 'free' as both gzip and bzip2 supports seek() and tell().

[~robertwb] / [~sbilac] / [~katsiapis@google.com] / [~altay]: 
Wondering if you considered using gzip module?
What are your thoughts on ditching read_buffer by bridged file ops to bzip2/gzip?

> Make fileio._CompressedFile seekable.
> -------------------------------------
>
>                 Key: BEAM-778
>                 URL: https://issues.apache.org/jira/browse/BEAM-778
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Chamikara Jayalath
>            Assignee: Tibor Kiss
>             Fix For: Not applicable
>
>
> We have a TODO to make fileio._CompressedFile seekable.
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692
> Without this, compressed file objects produce for FileBasedSource implementations may not be able to use libraries that utilize methods seek() and tell().
> For example tarfile.open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)