You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Lukasz Cwik <lc...@google.com> on 2017/12/21 18:28:29 UTC

Re: SerializableCoder Structured Value

https://issues.apache.org/jira/browse/BEAM-3279 has been filed for
consistentWithEquals
https://issues.apache.org/jira/browse/BEAM-3385 has been filed for
SerializableCoder

On Tue, Nov 28, 2017 at 1:20 PM, Eugene Kirpichov <ki...@google.com>
wrote:

> Kenn - I agree that consistentWithEquals() is redundant w.r.t.
> structuralValue(), and should be deprecated. I think our mutation detectors
> are already using structuralValue(), so the work here would be to simply
> mark the method deprecated, remove all remaining overrides in the SDK, and
> document that overriding the method is a no-op.
>
> Also agree with Luke that we should document that SerializableCoder should
> be used only for objects that have a proper equals(). Because we need
> *some* structural value for mutation detection, and since this coder is
> non-deterministic, it can't provide any structural value other than the
> object itself. In this case, I suppose, the work would involve just
> documentation.
>
> Not sure if we need anything extra in the direct runner for verification:
> won't existing mutation detectors already fire false positives in case
> these properties are violated?
>
> On Tue, Nov 28, 2017 at 11:03 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> I think that at least we should be clear in the documentation for
>> SerializableCoder and also make sure that the DirectRunner validates the
>> consistentWithEquals property.
>>
>> Optionally one of:
>> 1) Make a version of SerializableCoder that can be constructed where it
>> says it is consistentWithEquals and have users register each type with the
>> CoderRegistry.
>> 2) Document that users subclass SerializableCoder for all types which are
>> consistentWtihEquals and also register them with the CoderRegistry.
>>
>>
>> On Mon, Nov 27, 2017 at 5:39 PM, Kenneth Knowles <kl...@google.com> wrote:
>>
>>> What I said is not quite right - there are accidental collisions
>>> allowed. The "all coders" spec for structural value only requires that
>>> encode(a) == encode(b) implies sv(a).equals(sv(b)). The converse is not
>>> required. For example, the nondeterministic SetCoder can use the Set
>>> objects themselves as structural values, but their encoding may differ. So
>>> for determinism it is actually a.equals(b) implies encode(a) == encode(b)
>>> which in turn implies sv(a).equals(sv(b)). Either way, for deterministic
>>> coders they all coincide.
>>>
>>> On Mon, Nov 27, 2017 at 5:23 PM, Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> To add some flavor,
>>>>
>>>> *All coders:* structuralValue(a).equals(structuralValue(b)) if and
>>>> only if encode(a) == encode(b)
>>>>
>>>> *"Consistent with equals" aka injective:* encode(a) == encode(b)
>>>> implies a.equals(b)
>>>>
>>>> *Deterministic:* a.equals(b) implies structuralValue(a).equals(structuralValue(b))
>>>> (hence encode(a) == encode(b))
>>>>
>>>> The structural value must always be a legitimate substitute for
>>>> encoding to allow in-memory GBK to be faster than encoding.
>>>>
>>>> IMO we should deprecate and retire "consistent with equals" since
>>>> overriding it to return `true` is no simpler than overriding
>>>> structuralValue itself, and it has no purpose other than governing
>>>> structuralValue. The two obvious choices - encoding or return directly -
>>>> are trivial, and getting fancy is optional. The check Luke suggests would
>>>> then just be a test that structuralValue is correct. The mutation detector
>>>> should perhaps just use the structural value and let the coder itself
>>>> decide whether or not it needs to encode.
>>>>
>>>> Also worth considering the dual perspective that highlights
>>>> portability: To a portable runner, the elements are (with a couple
>>>> exceptions) just bytes, and the coders are a way for the SDK to interpret
>>>> them in order to do its computation. The implied spec that the mutation
>>>> detector relies on is that serialize(deserialize(x)) == x for these bytes,
>>>> so if the re-serialized bytes have changed, it assumes the object was
>>>> mutated. In a sense, if an SDK implements "the identity function" yet
>>>> returns different bytes, that is a broken identity function because the
>>>> bytes *are* the element. It is a bit of a strict interpretation, and maybe
>>>> not so useful when the elements are only really interpreted by a single
>>>> SDK, as in the case of SerializableCoder. But I'm not sure what other spec
>>>> is available.
>>>>
>>>> Kenn
>>>>
>>>>
>>>> On Mon, Nov 27, 2017 at 4:37 PM, Mairbek Khadikov <ma...@google.com>
>>>> wrote:
>>>>
>>>>> I'm open to renaming *consistentWithEquals*.
>>>>>
>>>>> If I understand the code correctly, when consistentWithEquals returns
>>>>> true, org.apache.beam.sdk.util.MutationDetectors expects
>>>>> *a.equals(deserialize(serialize(a))* which I think is reasonable for
>>>>> SerializableCoder (assuming objects implement equals)*. *Right now,
>>>>> *serialize(a).equals(serialize(deserialize(serialize(a)))* is
>>>>> expected and that contradicts *"does not guarantee a deterministic
>>>>> encoding"*.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 4:07 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I think the idea is that SerializableCoder should be updated to
>>>>>> expect that all values it encodes do implement equals() since this seems to
>>>>>> be the much more common case then classes that don't implement a useful
>>>>>> equals. It would be possible to add a useful check to DirectRunner that any
>>>>>> value that says its consistent with equals actually obeys its contract.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 4:03 PM, Eugene Kirpichov <
>>>>>> kirpichov@google.com> wrote:
>>>>>>
>>>>>>> Not sure where you see the contradiction? consistentWithEquals says
>>>>>>> "Whenever the encoded bytes of two values are equal, then the original
>>>>>>> values are equal according to {@code Objects.equals()}." - which is clearly
>>>>>>> false for Serializable's in general: it's possible that serialized form of
>>>>>>> "a" and "b" is the same bytes, but !a.equals(b), e.g. if this class does
>>>>>>> not implement equals() or if it uses reference equality.
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 3:55 PM Mairbek Khadikov <ma...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Currently SerializableCoder#consistentWithEquals returns false,
>>>>>>>> which contradicts it's own documentation "{@link SerializableCoder} does
>>>>>>>> not guarantee a deterministic encoding, as Java serialization may produce
>>>>>>>> different binary encodings for two equivalent objects".
>>>>>>>>
>>>>>>>> In practice, it leads to failures in org.apache.beam.sdk.util.
>>>>>>>> MutationDetectors$CodedValueMutationDetector. For example, some
>>>>>>>> libraries cache hash values https://github.com/
>>>>>>>> google/protobuf/pull/3939/files#diff-24c442a23a43e041e4f50d324638fe
>>>>>>>> dbR142
>>>>>>>>
>>>>>>>> I'm thinking about changing SerializableCoder#consistentWithEquals
>>>>>>>> to return true, so that the structural value would be the object itself
>>>>>>>> instead of it the serialized bytes. Thoughts?
>>>>>>>>
>>>>>>>> WIP branch https://github.com/mairbek/beam/tree/ser
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>> Mairbek
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Mairbek Khadikov |  Software Engineer |  mairbek@google.com
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Mairbek Khadikov |  Software Engineer |  mairbek@google.com
>>>>>
>>>>>
>>>>
>>>
>>