You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Ashwin Ramaswami <ar...@gmail.com> on 2020/05/27 16:09:57 UTC

Proposal for reading from / writing to archive files

I have a requirement where I need to read from / write to archive files (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I read from as a filesystem, so that I can only get the files I need that are within the archive. This is useful, because some archive formats such as .zip allow random access (so one does not need to read the entire zip file in order to just read a single file from it).

I've made an issue outlining how this might be designed -- would appreciate any feedback or thoughts about how this might work! https://issues.apache.org/jira/browse/BEAM-10111

Re: Proposal for reading from / writing to archive files

Posted by Robert Bradshaw <ro...@google.com>.

On Thu, May 28, 2020 at 9:34 AM Chamikara Jayalath <ch...@google.com>
wrote:

> Thanks for the contribution. This sounds very interesting. Few comments.
>
> * | fileio.MatchFiles('hdfs://path/to/*.zip') | fileio.ExtractMatches() |
> fileio.MatchAll()
>
> We usually either do 'fileio.MatchFiles('hdfs://path/to/*.zip')' or
> 'fileio.MatchAll()'. Former to read a specific glob and latter to read a
> PCollection of glob. We also have support for reading compressed files. We
> should add to that API instead of using both.
>
> * ArchiveSystem with list() and extract().
>
> Is this something we can add to the existing FileSystems abstraction
> instead of introducing a new abstraction ?
>

+1

In particular, something like
zip://hdfs://path/to/zip:glob/within/zip/*.txt could be a new zipfile
filesystem that can support parallel reads and delegate to any other
filesystem. One could then write

  p | fileio.MatchFiles('hdfs://path/to/*.zip')  # produces a PCollection
of zip file paths
    | fileio.ExtractMatches()  # produces a PCollection of zip file
entries, using a zipfile filesystem
    | fileio.ReadMatches()  # actually reads the files. One could to a text
read, or whatever, here as well.
    | ...

Note that tar files do not support random access (or even listing without
reading the entire contents), so are poorly suited for this.


> *
> fileio.CompressMatches
> fileio.WriteToArchive
>
> Is this scalable for a distributed system ? Usually we write a file per
> bundle.
>
> I suggest writing a doc with some background research related to how other
> data processing systems achieve this functionality so that we can try to
> determine if the functionality can be added to the existing API somehow.
>

Yeah, zip files are not writable in parallel. One /could/ do the
compression in parallel, and then have a final "writer" that just does
concat (with the appropriate headers) to the final zipfile(s).

On Wed, May 27, 2020 at 9:10 AM Ashwin Ramaswami <ar...@gmail.com>
> wrote:
>
>> I have a requirement where I need to read from / write to archive files
>> (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I
>> read from as a filesystem, so that I can only get the files I need that are
>> within the archive. This is useful, because some archive formats such as
>> .zip allow random access (so one does not need to read the entire zip file
>> in order to just read a single file from it).
>>
>> I've made an issue outlining how this might be designed -- would
>> appreciate any feedback or thoughts about how this might work!
>> https://issues.apache.org/jira/browse/BEAM-10111
>>
>

Re: Proposal for reading from / writing to archive files

Posted by Chamikara Jayalath <ch...@google.com>.

Thanks for the contribution. This sounds very interesting. Few comments.

* | fileio.MatchFiles('hdfs://path/to/*.zip') | fileio.ExtractMatches() |
fileio.MatchAll()

We usually either do 'fileio.MatchFiles('hdfs://path/to/*.zip')' or
'fileio.MatchAll()'. Former to read a specific glob and latter to read a
PCollection of glob. We also have support for reading compressed files. We
should add to that API instead of using both.

* ArchiveSystem with list() and extract().

Is this something we can add to the existing FileSystems abstraction
instead of introducing a new abstraction ?

*
fileio.CompressMatches
fileio.WriteToArchive

Is this scalable for a distributed system ? Usually we write a file per
bundle.

I suggest writing a doc with some background research related to how other
data processing systems achieve this functionality so that we can try to
determine if the functionality can be added to the existing API somehow.

Thanks,
Cham

On Wed, May 27, 2020 at 9:10 AM Ashwin Ramaswami <ar...@gmail.com>
wrote:

> I have a requirement where I need to read from / write to archive files
> (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I
> read from as a filesystem, so that I can only get the files I need that are
> within the archive. This is useful, because some archive formats such as
> .zip allow random access (so one does not need to read the entire zip file
> in order to just read a single file from it).
>
> I've made an issue outlining how this might be designed -- would
> appreciate any feedback or thoughts about how this might work!
> https://issues.apache.org/jira/browse/BEAM-10111
>