You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Christopher Larsen <ch...@google.com> on 2020/04/22 17:11:06 UTC

[QUESTION] Reading Snappy Compressed Text Files

Hi devs,

We are trying to build a pipeline to read snappy compressed text files that
contain one record per line using the Java SDK.

We have tried the following to read the files:

p.apply("ReadLines",
FileIO.match().filepattern((options.getInputFilePattern())))
        .apply(FileIO.readMatches())
        .setCoder(SnappyCoder.of(ReadableFileCoder.of()))
        .apply(TextIO.readFiles())
        .apply(ParDo.of(new TransformRecord()));

Is there a recommended way to decompress and read Snappy files with Beam?

Thanks,
Chris

Re: [QUESTION] Reading Snappy Compressed Text Files

Posted by Robert Bradshaw <ro...@google.com>.
On Wed, Apr 22, 2020 at 11:06 AM Jeff Klukas <jk...@mozilla.com> wrote:

> Beam is able to infer compression from file extensions for a variety of
> formats, but snappy is not among them currently:
>
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java
>
> Although ParquetIO and AvroIO each look to have support for snappy.
>
> So as best I can tell, there is no current built-in support for reading
> text files compressed via snappy. I think you would need to use FileIO to
> match files, and then implement a custom DoFn that take the file object,
> streams the contents through a snappy decompressor, and outputs one record
> per line.
>
> I imagine a PR to add snappy as a supported format in Compression.java
> would be welcome.
>

+1, and probably not that difficult either.


>
> On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <ch...@google.com>
> wrote:
>
>> Hi devs,
>>
>> We are trying to build a pipeline to read snappy compressed text files
>> that contain one record per line using the Java SDK.
>>
>> We have tried the following to read the files:
>>
>> p.apply("ReadLines",
>> FileIO.match().filepattern((options.getInputFilePattern())))
>>         .apply(FileIO.readMatches())
>>         .setCoder(SnappyCoder.of(ReadableFileCoder.of()))
>>         .apply(TextIO.readFiles())
>>         .apply(ParDo.of(new TransformRecord()));
>>
>> Is there a recommended way to decompress and read Snappy files with Beam?
>>
>> Thanks,
>> Chris
>>
>

Re: [QUESTION] Reading Snappy Compressed Text Files

Posted by Jeff Klukas <jk...@mozilla.com>.
Beam is able to infer compression from file extensions for a variety of
formats, but snappy is not among them currently:

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java

Although ParquetIO and AvroIO each look to have support for snappy.

So as best I can tell, there is no current built-in support for reading
text files compressed via snappy. I think you would need to use FileIO to
match files, and then implement a custom DoFn that take the file object,
streams the contents through a snappy decompressor, and outputs one record
per line.

I imagine a PR to add snappy as a supported format in Compression.java
would be welcome.

On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <ch...@google.com>
wrote:

> Hi devs,
>
> We are trying to build a pipeline to read snappy compressed text files
> that contain one record per line using the Java SDK.
>
> We have tried the following to read the files:
>
> p.apply("ReadLines",
> FileIO.match().filepattern((options.getInputFilePattern())))
>         .apply(FileIO.readMatches())
>         .setCoder(SnappyCoder.of(ReadableFileCoder.of()))
>         .apply(TextIO.readFiles())
>         .apply(ParDo.of(new TransformRecord()));
>
> Is there a recommended way to decompress and read Snappy files with Beam?
>
> Thanks,
> Chris
>