You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Gleb Kanterov <gl...@spotify.com> on 2019/08/13 20:13:20 UTC

Java serialization for coders and compatibility

I'm looking into the code of AvroCoder, and I was wondering what happens
when users upgrade Beam for streaming pipelines?

As I understand it, we should be able to deserialize coder from previous
Beam version. Looking into guava vendoring, it's going to break
serialization when we are going to switch guava version because the current
version is a part of the namespace:

import
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Supplier;
import
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Suppliers;

We don't have tests for it, but probably we already broke compatibility
when we vendored guava. Can anybody clarify what would be the approach for
coders?

Re: Java serialization for coders and compatibility

Posted by Lukasz Cwik <lc...@google.com>.
Coders such as AvroCoder are translated to an intermediate JSON form called
a CloudObject[1].
Dataflow only uses the serialized Java representation (embedded as bytes in
?base64? within the CloudObject) for coders which extend
SerializableCoder[2].
Dataflow only cares that these CloudObject representations didn't change.

For Beam runners in general today, they could rely on the proto version of
the coder (which is meant to replace the CloudObject that Dataflow uses
eventually). Relying on not breaking Java serialization makes changing
coders too strict.

Eventually we could be using schemas for most things and then the
representation will be completely definable and separate from the
implementation.

There might be more details in the snapshotting and update design doc[3]
but I believe it was pretty light on how to deal with these kinds of
changes beyond that we know its a problem.

1:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/AvroCoderCloudObjectTranslator.java
2:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/SerializableCoder.java
3:
https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MYhttps://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
<https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY>


On Tue, Aug 13, 2019 at 1:13 PM Gleb Kanterov <gl...@spotify.com> wrote:

> I'm looking into the code of AvroCoder, and I was wondering what happens
> when users upgrade Beam for streaming pipelines?
>
> As I understand it, we should be able to deserialize coder from previous
> Beam version. Looking into guava vendoring, it's going to break
> serialization when we are going to switch guava version because the current
> version is a part of the namespace:
>
> import
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Supplier;
> import
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Suppliers;
>
> We don't have tests for it, but probably we already broke compatibility
> when we vendored guava. Can anybody clarify what would be the approach for
> coders?
>