You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Christopher Larsen <ch...@google.com> on 2020/04/22 17:11:06 UTC
[QUESTION] Reading Snappy Compressed Text Files
Hi devs,
We are trying to build a pipeline to read snappy compressed text files that
contain one record per line using the Java SDK.
We have tried the following to read the files:
p.apply("ReadLines",
FileIO.match().filepattern((options.getInputFilePattern())))
.apply(FileIO.readMatches())
.setCoder(SnappyCoder.of(ReadableFileCoder.of()))
.apply(TextIO.readFiles())
.apply(ParDo.of(new TransformRecord()));
Is there a recommended way to decompress and read Snappy files with Beam?
Thanks,
Chris
Re: [QUESTION] Reading Snappy Compressed Text Files
Posted by Robert Bradshaw <ro...@google.com>.
On Wed, Apr 22, 2020 at 11:06 AM Jeff Klukas <jk...@mozilla.com> wrote:
> Beam is able to infer compression from file extensions for a variety of
> formats, but snappy is not among them currently:
>
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java
>
> Although ParquetIO and AvroIO each look to have support for snappy.
>
> So as best I can tell, there is no current built-in support for reading
> text files compressed via snappy. I think you would need to use FileIO to
> match files, and then implement a custom DoFn that take the file object,
> streams the contents through a snappy decompressor, and outputs one record
> per line.
>
> I imagine a PR to add snappy as a supported format in Compression.java
> would be welcome.
>
+1, and probably not that difficult either.
>
> On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <ch...@google.com>
> wrote:
>
>> Hi devs,
>>
>> We are trying to build a pipeline to read snappy compressed text files
>> that contain one record per line using the Java SDK.
>>
>> We have tried the following to read the files:
>>
>> p.apply("ReadLines",
>> FileIO.match().filepattern((options.getInputFilePattern())))
>> .apply(FileIO.readMatches())
>> .setCoder(SnappyCoder.of(ReadableFileCoder.of()))
>> .apply(TextIO.readFiles())
>> .apply(ParDo.of(new TransformRecord()));
>>
>> Is there a recommended way to decompress and read Snappy files with Beam?
>>
>> Thanks,
>> Chris
>>
>
Re: [QUESTION] Reading Snappy Compressed Text Files
Posted by Jeff Klukas <jk...@mozilla.com>.
Beam is able to infer compression from file extensions for a variety of
formats, but snappy is not among them currently:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java
Although ParquetIO and AvroIO each look to have support for snappy.
So as best I can tell, there is no current built-in support for reading
text files compressed via snappy. I think you would need to use FileIO to
match files, and then implement a custom DoFn that take the file object,
streams the contents through a snappy decompressor, and outputs one record
per line.
I imagine a PR to add snappy as a supported format in Compression.java
would be welcome.
On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <ch...@google.com>
wrote:
> Hi devs,
>
> We are trying to build a pipeline to read snappy compressed text files
> that contain one record per line using the Java SDK.
>
> We have tried the following to read the files:
>
> p.apply("ReadLines",
> FileIO.match().filepattern((options.getInputFilePattern())))
> .apply(FileIO.readMatches())
> .setCoder(SnappyCoder.of(ReadableFileCoder.of()))
> .apply(TextIO.readFiles())
> .apply(ParDo.of(new TransformRecord()));
>
> Is there a recommended way to decompress and read Snappy files with Beam?
>
> Thanks,
> Chris
>