You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Randal Moore <rd...@gmail.com> on 2018/10/12 21:39:57 UTC

Strange 'gzip error' running Beam on Dataflow

Using Beam Java SDK 2.6.

I have a batch pipeline that has run successfully in its current several
times. Suddenly I am getting strange errors complaining about the format of
the input. As far as I know, the pipeline didn't change at all since the
last successful run. The error:
java.util.zip.ZipException: Not in GZIP format - Trace:
org.apache.beam.sdk.util.UserCodeException
indicates that something somewhere thinks the line of text is supposed to
be gzipped. I don't know what is setting that expectation nor what code is
thinking that it is supposed to be gzipped.

The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The
content of the bucket object is individual "text" lines (actually each line
is JSON encoded). This error is in the first doFn following the TextIO -
that  converts each string to an value object.

My log message in the exception handler shows the exact text for the string
that I am expecting. I tried logging the callstack to see where the GZIP
exception is thrown - turns out to be a bit hard to follow (with a bunch of
dataflow classes called at the line in the processElement method that first
uses the string).


   - Changing the lines to pure text, like "hello" and "world", gets to the
   JSON parser, which throws an error (since it isn't JSON any more).
   - If I base64 encode the lines, I [still] get the GZIP exception.
   - I was running an older version of Beam so I upgraded to 2.6. Didn't
   help
   - The bucket object uses *application/octet-encoding*
   - Tried changing the read from the bucket from the default to explicitly
   using uncompressed.
   TextIO.read.from(job.inputsPath).withCompression(Compression.UNCOMPRESSED
   )

One other details is that most of the code is written in Scala even though
it uses the Java SDK for Beam.

Any help appreciated!
rdm

Re: Strange 'gzip error' running Beam on Dataflow

Posted by Randal Moore <rd...@gmail.com>.
The files have no content-encoding set. They are no big query exports but
rather crafted by a service of mine.

Note that my doFunc gets called for each line of the file, something that I
don't think would happen - wouldn't it apply gunzip to the whole content?

On Fri, Oct 12, 2018, 5:04 PM Jose Ignacio Honrado <ji...@gmail.com>
wrote:

> Hi Randal,
>
> You might be experiencing the automatic decompressive transcoding from
> GCS. Take a look at this to see if it helps:
> https://cloud.google.com/storage/docs/transcoding
>
> It seems like a compressed file is expected (as for the gz extension), but
> the file is returned decompressed by GCS.
>
> Any change these files in GCS are exported from BigQuery? I started to
> "suffer" a similar issue cause the exports from BQ tables to GCS started
> setting new metadata (content-encoding: gzip, content-type: text/csv) to
> the output files and, as consequence, GZIP files were automatically
> decompressed when downloading them (as explained in the previous link).
>
> Best,
>
>
> El vie., 12 oct. 2018 23:40, Randal Moore <rd...@gmail.com> escribió:
>
>> Using Beam Java SDK 2.6.
>>
>> I have a batch pipeline that has run successfully in its current several
>> times. Suddenly I am getting strange errors complaining about the format of
>> the input. As far as I know, the pipeline didn't change at all since the
>> last successful run. The error:
>> java.util.zip.ZipException: Not in GZIP format - Trace:
>> org.apache.beam.sdk.util.UserCodeException
>> indicates that something somewhere thinks the line of text is supposed to
>> be gzipped. I don't know what is setting that expectation nor what code is
>> thinking that it is supposed to be gzipped.
>>
>> The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The
>> content of the bucket object is individual "text" lines (actually each line
>> is JSON encoded). This error is in the first doFn following the TextIO -
>> that  converts each string to an value object.
>>
>> My log message in the exception handler shows the exact text for the
>> string that I am expecting. I tried logging the callstack to see where the
>> GZIP exception is thrown - turns out to be a bit hard to follow (with a
>> bunch of dataflow classes called at the line in the processElement method
>> that first uses the string).
>>
>>
>>    - Changing the lines to pure text, like "hello" and "world", gets to
>>    the JSON parser, which throws an error (since it isn't JSON any more).
>>    - If I base64 encode the lines, I [still] get the GZIP exception.
>>    - I was running an older version of Beam so I upgraded to 2.6. Didn't
>>    help
>>    - The bucket object uses *application/octet-encoding*
>>    - Tried changing the read from the bucket from the default to
>>    explicitly using uncompressed.
>>    TextIO.read.from(job.inputsPath).withCompression(Compression.
>>    UNCOMPRESSED)
>>
>> One other details is that most of the code is written in Scala even
>> though it uses the Java SDK for Beam.
>>
>> Any help appreciated!
>> rdm
>>
>>
>>

Re: Strange 'gzip error' running Beam on Dataflow

Posted by Jose Ignacio Honrado <ji...@gmail.com>.
Hi Randal,

You might be experiencing the automatic decompressive transcoding from GCS.
Take a look at this to see if it helps:
https://cloud.google.com/storage/docs/transcoding

It seems like a compressed file is expected (as for the gz extension), but
the file is returned decompressed by GCS.

Any change these files in GCS are exported from BigQuery? I started to
"suffer" a similar issue cause the exports from BQ tables to GCS started
setting new metadata (content-encoding: gzip, content-type: text/csv) to
the output files and, as consequence, GZIP files were automatically
decompressed when downloading them (as explained in the previous link).

Best,

El vie., 12 oct. 2018 23:40, Randal Moore <rd...@gmail.com> escribió:

> Using Beam Java SDK 2.6.
>
> I have a batch pipeline that has run successfully in its current several
> times. Suddenly I am getting strange errors complaining about the format of
> the input. As far as I know, the pipeline didn't change at all since the
> last successful run. The error:
> java.util.zip.ZipException: Not in GZIP format - Trace:
> org.apache.beam.sdk.util.UserCodeException
> indicates that something somewhere thinks the line of text is supposed to
> be gzipped. I don't know what is setting that expectation nor what code is
> thinking that it is supposed to be gzipped.
>
> The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The
> content of the bucket object is individual "text" lines (actually each line
> is JSON encoded). This error is in the first doFn following the TextIO -
> that  converts each string to an value object.
>
> My log message in the exception handler shows the exact text for the
> string that I am expecting. I tried logging the callstack to see where the
> GZIP exception is thrown - turns out to be a bit hard to follow (with a
> bunch of dataflow classes called at the line in the processElement method
> that first uses the string).
>
>
>    - Changing the lines to pure text, like "hello" and "world", gets to
>    the JSON parser, which throws an error (since it isn't JSON any more).
>    - If I base64 encode the lines, I [still] get the GZIP exception.
>    - I was running an older version of Beam so I upgraded to 2.6. Didn't
>    help
>    - The bucket object uses *application/octet-encoding*
>    - Tried changing the read from the bucket from the default to
>    explicitly using uncompressed.
>    TextIO.read.from(job.inputsPath).withCompression(Compression.
>    UNCOMPRESSED)
>
> One other details is that most of the code is written in Scala even though
> it uses the Java SDK for Beam.
>
> Any help appreciated!
> rdm
>
>
>