You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Colm O hEigeartaigh <co...@apache.org> on 2020/03/19 11:50:33 UTC

Question on SerializableCoder

Hi,

I have a question on SerializableCoder. I'm looking at hardening the Java
Object deserialization that is taking place. We have a "Class<T> type" that
is used to decode the input stream:

ObjectInputStream ois = new ObjectInputStream(inStream);
return type.cast(ois.readObject());

What I would like to do would be something like:

ObjectInputStream ois = new ObjectInputStream(inStream) {
    @Override
    protected Class<?> resolveClass(ObjectStreamClass desc) throws
IOException, ClassNotFoundException {
        if (!desc.getName().equals(type.getName())) {
            throw new InvalidClassException("Unauthorized deserialization
attempt", desc.getName());
        }
        return super.resolveClass(desc);
    }
};
return type.cast(ois.readObject());

This would prevent a possible security hole where an attacker could try to
force the recipient of the input stream to deserialize to a gadget class or
the like for a RCE.

The question is - does the deserialized type have to correspond exactly to
the supplied Class? Or is it supported that it's a base type / abstract
class? If the latter then my idea won't really work. But if the type
corresponds exactly then it should work OK.

Thanks,

Colm.

Re: Question on SerializableCoder

Posted by Colm O hEigeartaigh <co...@apache.org>.

Thanks Luke. I updated the JIRA to suggest changing the docs along the
lines of your post: https://issues.apache.org/jira/browse/BEAM-9570

Colm.

On Thu, Mar 26, 2020 at 3:06 PM Luke Cwik <lc...@google.com> wrote:

> From the private@ thread, I suggested:
> "With the JvmInitializer[1] being supported by Dataflow and the portable
> Java container, users would be able to write code which sets the system
> property jdk.serialFilter or by configuring
> ObjectInputFilter.Config.setSerialFilter(filter)[2]"
>
> This could become a documentation change to SerializableCoder.
>
> 1:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
> 2:
> https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25
>
> On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <co...@apache.org>
> wrote:
>
>> Thanks for all the feedback. Two possible ideas that occur to me:
>>
>>  - Create TypeSafeSerializableCoder or somesuch which extends
>> SerializableCoder and enforces the check as in the PR. Update the
>> documentation to suggest using the new coder if you don't need to support
>> collections as the raw type or subclasses.
>>  - Add a new "of" method to SerializableCoder which takes a boolean
>> parameter to control whether we perform this check or not (defaults to
>> true?).
>>
>> Colm.
>>
>> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> I have seen people ingest data using SerializableCoder.
>>>
>>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> I won't bring other people's words from private@, but can share mine.
>>>> I don't believe it exposes anything new.
>>>>
>>>> > If it is SerializableCoder - attacker controls the other end of e.g.
>>>> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have
>>>> an allowlist or try to automatically construct an allowlist] and otherwise
>>>> there is no vulnerability for internal coders.
>>>>
>>>> I have never seen or heard of a user doing dynamic deserialization
>>>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is
>>>> important to someone then they would need a more secure solution than
>>>> SerializableCoder.
>>>>
>>>> Side note: it would be great to provide an efficient and usable
>>>> solution for the problem of wanting to dynamically dispatch serde in the
>>>> middle of a pipeline. It is actually independent from being able to provide
>>>> coders for a wide variety of types, which we can do a bunch of different
>>>> (mostly better) ways. (has a better solution been built since the last time
>>>> I thought about this?)
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>>> The link to the previous covnersation (discussion happened in private@
>>>>> and I suppose we can bring some relevant bits here if needed)
>>>>>
>>>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>>>>>
>>>>> I remember Robert had some points there, but I am not sure we
>>>>> found/agreed on a solution that was relevant and did not break current
>>>>> users and their user experience (like the case of blacklists).
>>>>>
>>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
>>>>> >
>>>>> > Being able to have something that can encode any object (or at least
>>>>> a large class of objects) is extremely powerful so requiring
>>>>> SerializableCoder<T> to only encode T.class would hurt our users.
>>>>> >
>>>>> > I believe someone looked at this kind of problem before and we came
>>>>> to an agreement of usng an explicit approve/deny list on the class names
>>>>> which would address the security concern. I don't remember the thread
>>>>> though and couldn't find the thread after a few minutes of searching.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >>
>>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't
>>>>> really have one) should be that it dynamically dispatches what it
>>>>> deserializes? I had imagined we would treat it more as a statically
>>>>> determined coder, so because it is invariant in T we would not allow up or
>>>>> down casts (they are unsafe). But we probably don't actually have the
>>>>> static information to do that anyhow so you are probably right.
>>>>> >>
>>>>> >> I wonder about the threat model here. Is this the event that the
>>>>> runner (managed service or bespoke cluster) is compromised and is
>>>>> attempting RCE on the Java SDK harness or runner-specific Java-based worker?
>>>>> >>
>>>>> >> Kenn
>>>>> >>
>>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>>>>> >>>
>>>>> >>> I don't think this is going to work since SerializableCoder<T>
>>>>> should be able to decode T and all objects that implement/extend T. I'm
>>>>> pretty sure SerializableCoder<Set/List/...> is common enough while the
>>>>> concrete type is HashSet/ArrayList/...
>>>>> >>> I'm pretty sure there is some way you could come up with some way
>>>>> for making this optin though.
>>>>> >>>
>>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
>>>>> coheigea@apache.org> wrote:
>>>>> >>>>
>>>>> >>>> Thanks Kenn. I submitted a PR here:
>>>>> https://github.com/apache/beam/pull/11191
>>>>> >>>>
>>>>> >>>> Colm.
>>>>> >>>>
>>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>> I think this is fine. The same coder is used for encode and
>>>>> decode, so the Class object should be the same as well. Inheritance is not
>>>>> part of the Beam model (thank goodness) so this is a language-specific
>>>>> concern. As far as the model is concerned, the full URN and the payload of
>>>>> the coder is its identity and coders with different identities have no
>>>>> inheritance or compatibility relationship. Pipeline snapshot/update is an
>>>>> edge case, but changing coder is not supported by any runner I know of, and
>>>>> probably won't be until we have some rather large new ideas.
>>>>> >>>>>
>>>>> >>>>> Kenn
>>>>> >>>>>
>>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
>>>>> coheigea@apache.org> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Hi,
>>>>> >>>>>>
>>>>> >>>>>> I have a question on SerializableCoder. I'm looking at
>>>>> hardening the Java Object deserialization that is taking place. We have a
>>>>> "Class<T> type" that is used to decode the input stream:
>>>>> >>>>>>
>>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>>> >>>>>> return type.cast(ois.readObject());
>>>>> >>>>>>
>>>>> >>>>>> What I would like to do would be something like:
>>>>> >>>>>>
>>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>>> >>>>>>     @Override
>>>>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc)
>>>>> throws IOException, ClassNotFoundException {
>>>>> >>>>>>         if (!desc.getName().equals(type.getName())) {
>>>>> >>>>>>             throw new InvalidClassException("Unauthorized
>>>>> deserialization attempt", desc.getName());
>>>>> >>>>>>         }
>>>>> >>>>>>         return super.resolveClass(desc);
>>>>> >>>>>>     }
>>>>> >>>>>> };
>>>>> >>>>>> return type.cast(ois.readObject());
>>>>> >>>>>>
>>>>> >>>>>> This would prevent a possible security hole where an attacker
>>>>> could try to force the recipient of the input stream to deserialize to a
>>>>> gadget class or the like for a RCE.
>>>>> >>>>>>
>>>>> >>>>>> The question is - does the deserialized type have to correspond
>>>>> exactly to the supplied Class? Or is it supported that it's a base type /
>>>>> abstract class? If the latter then my idea won't really work. But if the
>>>>> type corresponds exactly then it should work OK.
>>>>> >>>>>>
>>>>> >>>>>> Thanks,
>>>>> >>>>>>
>>>>> >>>>>> Colm.
>>>>>
>>>>

Re: Question on SerializableCoder

Posted by Luke Cwik <lc...@google.com>.

For the system property approach, the underlying execution engine
(Flink/Spark/...) need to use a separate JVM for each job instance and not
run multiple jobs within one JVM otherwise the users JVM initializer may
step on someone else's JVM initializer logic. The execution engine also
needs to expose some way to run hooks before job start that Beam's
JvmInitializers would need to be translated to.

Assuming all the above is supported by an execution engine then yes. We
could tell users to make sure that the system property jdk.serialFilter is
configured correctly for the users cluster (however that may be done per
runner) otherwise we would want a pipeline level wide ObjectInputFilter
option and/or per SerializableCoder instance ObjectInputFilter.

On Thu, Mar 26, 2020 at 8:11 AM Ismaël Mejía <ie...@gmail.com> wrote:

> Luke, can a similar approach be used in classic runners? What would we
> need to change to achieve it?
>
> On Thu, Mar 26, 2020 at 4:06 PM Luke Cwik <lc...@google.com> wrote:
> >
> > From the private@ thread, I suggested:
> > "With the JvmInitializer[1] being supported by Dataflow and the portable
> Java container, users would be able to write code which sets the system
> property jdk.serialFilter or by configuring
> ObjectInputFilter.Config.setSerialFilter(filter)[2]"
> >
> > This could become a documentation change to SerializableCoder.
> >
> > 1:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
> > 2:
> https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25
> >
> > On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <co...@apache.org>
> wrote:
> >>
> >> Thanks for all the feedback. Two possible ideas that occur to me:
> >>
> >>  - Create TypeSafeSerializableCoder or somesuch which extends
> SerializableCoder and enforces the check as in the PR. Update the
> documentation to suggest using the new coder if you don't need to support
> collections as the raw type or subclasses.
> >>  - Add a new "of" method to SerializableCoder which takes a boolean
> parameter to control whether we perform this check or not (defaults to
> true?).
> >>
> >> Colm.
> >>
> >> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <lc...@google.com> wrote:
> >>>
> >>> I have seen people ingest data using SerializableCoder.
> >>>
> >>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>
> >>>> I won't bring other people's words from private@, but can share
> mine. I don't believe it exposes anything new.
> >>>>
> >>>> > If it is SerializableCoder - attacker controls the other end of
> e.g. Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could
> have an allowlist or try to automatically construct an allowlist] and
> otherwise there is no vulnerability for internal coders.
> >>>>
> >>>> I have never seen or heard of a user doing dynamic deserialization
> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is
> important to someone then they would need a more secure solution than
> SerializableCoder.
> >>>>
> >>>> Side note: it would be great to provide an efficient and usable
> solution for the problem of wanting to dynamically dispatch serde in the
> middle of a pipeline. It is actually independent from being able to provide
> coders for a wide variety of types, which we can do a bunch of different
> (mostly better) ways. (has a better solution been built since the last time
> I thought about this?)
> >>>>
> >>>> Kenn
> >>>>
> >>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >>>>>
> >>>>> The link to the previous covnersation (discussion happened in
> private@
> >>>>> and I suppose we can bring some relevant bits here if needed)
> >>>>>
> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
> >>>>>
> >>>>> I remember Robert had some points there, but I am not sure we
> >>>>> found/agreed on a solution that was relevant and did not break
> current
> >>>>> users and their user experience (like the case of blacklists).
> >>>>>
> >>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
> >>>>> >
> >>>>> > Being able to have something that can encode any object (or at
> least a large class of objects) is extremely powerful so requiring
> SerializableCoder<T> to only encode T.class would hurt our users.
> >>>>> >
> >>>>> > I believe someone looked at this kind of problem before and we
> came to an agreement of usng an explicit approve/deny list on the class
> names which would address the security concern. I don't remember the thread
> though and couldn't find the thread after a few minutes of searching.
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>> >>
> >>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't
> really have one) should be that it dynamically dispatches what it
> deserializes? I had imagined we would treat it more as a statically
> determined coder, so because it is invariant in T we would not allow up or
> down casts (they are unsafe). But we probably don't actually have the
> static information to do that anyhow so you are probably right.
> >>>>> >>
> >>>>> >> I wonder about the threat model here. Is this the event that the
> runner (managed service or bespoke cluster) is compromised and is
> attempting RCE on the Java SDK harness or runner-specific Java-based worker?
> >>>>> >>
> >>>>> >> Kenn
> >>>>> >>
> >>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com>
> wrote:
> >>>>> >>>
> >>>>> >>> I don't think this is going to work since SerializableCoder<T>
> should be able to decode T and all objects that implement/extend T. I'm
> pretty sure SerializableCoder<Set/List/...> is common enough while the
> concrete type is HashSet/ArrayList/...
> >>>>> >>> I'm pretty sure there is some way you could come up with some
> way for making this optin though.
> >>>>> >>>
> >>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
> coheigea@apache.org> wrote:
> >>>>> >>>>
> >>>>> >>>> Thanks Kenn. I submitted a PR here:
> https://github.com/apache/beam/pull/11191
> >>>>> >>>>
> >>>>> >>>> Colm.
> >>>>> >>>>
> >>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <
> kenn@apache.org> wrote:
> >>>>> >>>>>
> >>>>> >>>>> I think this is fine. The same coder is used for encode and
> decode, so the Class object should be the same as well. Inheritance is not
> part of the Beam model (thank goodness) so this is a language-specific
> concern. As far as the model is concerned, the full URN and the payload of
> the coder is its identity and coders with different identities have no
> inheritance or compatibility relationship. Pipeline snapshot/update is an
> edge case, but changing coder is not supported by any runner I know of, and
> probably won't be until we have some rather large new ideas.
> >>>>> >>>>>
> >>>>> >>>>> Kenn
> >>>>> >>>>>
> >>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
> coheigea@apache.org> wrote:
> >>>>> >>>>>>
> >>>>> >>>>>> Hi,
> >>>>> >>>>>>
> >>>>> >>>>>> I have a question on SerializableCoder. I'm looking at
> hardening the Java Object deserialization that is taking place. We have a
> "Class<T> type" that is used to decode the input stream:
> >>>>> >>>>>>
> >>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
> >>>>> >>>>>> return type.cast(ois.readObject());
> >>>>> >>>>>>
> >>>>> >>>>>> What I would like to do would be something like:
> >>>>> >>>>>>
> >>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
> >>>>> >>>>>>     @Override
> >>>>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc)
> throws IOException, ClassNotFoundException {
> >>>>> >>>>>>         if (!desc.getName().equals(type.getName())) {
> >>>>> >>>>>>             throw new InvalidClassException("Unauthorized
> deserialization attempt", desc.getName());
> >>>>> >>>>>>         }
> >>>>> >>>>>>         return super.resolveClass(desc);
> >>>>> >>>>>>     }
> >>>>> >>>>>> };
> >>>>> >>>>>> return type.cast(ois.readObject());
> >>>>> >>>>>>
> >>>>> >>>>>> This would prevent a possible security hole where an attacker
> could try to force the recipient of the input stream to deserialize to a
> gadget class or the like for a RCE.
> >>>>> >>>>>>
> >>>>> >>>>>> The question is - does the deserialized type have to
> correspond exactly to the supplied Class? Or is it supported that it's a
> base type / abstract class? If the latter then my idea won't really work.
> But if the type corresponds exactly then it should work OK.
> >>>>> >>>>>>
> >>>>> >>>>>> Thanks,
> >>>>> >>>>>>
> >>>>> >>>>>> Colm.
>

Re: Question on SerializableCoder

Posted by Ismaël Mejía <ie...@gmail.com>.

Luke, can a similar approach be used in classic runners? What would we
need to change to achieve it?

On Thu, Mar 26, 2020 at 4:06 PM Luke Cwik <lc...@google.com> wrote:
>
> From the private@ thread, I suggested:
> "With the JvmInitializer[1] being supported by Dataflow and the portable Java container, users would be able to write code which sets the system property jdk.serialFilter or by configuring ObjectInputFilter.Config.setSerialFilter(filter)[2]"
>
> This could become a documentation change to SerializableCoder.
>
> 1: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
> 2: https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25
>
> On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <co...@apache.org> wrote:
>>
>> Thanks for all the feedback. Two possible ideas that occur to me:
>>
>>  - Create TypeSafeSerializableCoder or somesuch which extends SerializableCoder and enforces the check as in the PR. Update the documentation to suggest using the new coder if you don't need to support collections as the raw type or subclasses.
>>  - Add a new "of" method to SerializableCoder which takes a boolean parameter to control whether we perform this check or not (defaults to true?).
>>
>> Colm.
>>
>> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>> I have seen people ingest data using SerializableCoder.
>>>
>>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>
>>>> I won't bring other people's words from private@, but can share mine. I don't believe it exposes anything new.
>>>>
>>>> > If it is SerializableCoder - attacker controls the other end of e.g. Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have an allowlist or try to automatically construct an allowlist] and otherwise there is no vulnerability for internal coders.
>>>>
>>>> I have never seen or heard of a user doing dynamic deserialization dispatch on ingestion, but that doesn't mean it doesn't happen. If it is important to someone then they would need a more secure solution than SerializableCoder.
>>>>
>>>> Side note: it would be great to provide an efficient and usable solution for the problem of wanting to dynamically dispatch serde in the middle of a pipeline. It is actually independent from being able to provide coders for a wide variety of types, which we can do a bunch of different (mostly better) ways. (has a better solution been built since the last time I thought about this?)
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>>
>>>>> The link to the previous covnersation (discussion happened in private@
>>>>> and I suppose we can bring some relevant bits here if needed)
>>>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>>>>>
>>>>> I remember Robert had some points there, but I am not sure we
>>>>> found/agreed on a solution that was relevant and did not break current
>>>>> users and their user experience (like the case of blacklists).
>>>>>
>>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
>>>>> >
>>>>> > Being able to have something that can encode any object (or at least a large class of objects) is extremely powerful so requiring SerializableCoder<T> to only encode T.class would hurt our users.
>>>>> >
>>>>> > I believe someone looked at this kind of problem before and we came to an agreement of usng an explicit approve/deny list on the class names which would address the security concern. I don't remember the thread though and couldn't find the thread after a few minutes of searching.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>> >>
>>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't really have one) should be that it dynamically dispatches what it deserializes? I had imagined we would treat it more as a statically determined coder, so because it is invariant in T we would not allow up or down casts (they are unsafe). But we probably don't actually have the static information to do that anyhow so you are probably right.
>>>>> >>
>>>>> >> I wonder about the threat model here. Is this the event that the runner (managed service or bespoke cluster) is compromised and is attempting RCE on the Java SDK harness or runner-specific Java-based worker?
>>>>> >>
>>>>> >> Kenn
>>>>> >>
>>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>>>>> >>>
>>>>> >>> I don't think this is going to work since SerializableCoder<T> should be able to decode T and all objects that implement/extend T. I'm pretty sure SerializableCoder<Set/List/...> is common enough while the concrete type is HashSet/ArrayList/...
>>>>> >>> I'm pretty sure there is some way you could come up with some way for making this optin though.
>>>>> >>>
>>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <co...@apache.org> wrote:
>>>>> >>>>
>>>>> >>>> Thanks Kenn. I submitted a PR here: https://github.com/apache/beam/pull/11191
>>>>> >>>>
>>>>> >>>> Colm.
>>>>> >>>>
>>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>> >>>>>
>>>>> >>>>> I think this is fine. The same coder is used for encode and decode, so the Class object should be the same as well. Inheritance is not part of the Beam model (thank goodness) so this is a language-specific concern. As far as the model is concerned, the full URN and the payload of the coder is its identity and coders with different identities have no inheritance or compatibility relationship. Pipeline snapshot/update is an edge case, but changing coder is not supported by any runner I know of, and probably won't be until we have some rather large new ideas.
>>>>> >>>>>
>>>>> >>>>> Kenn
>>>>> >>>>>
>>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <co...@apache.org> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Hi,
>>>>> >>>>>>
>>>>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening the Java Object deserialization that is taking place. We have a "Class<T> type" that is used to decode the input stream:
>>>>> >>>>>>
>>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>>> >>>>>> return type.cast(ois.readObject());
>>>>> >>>>>>
>>>>> >>>>>> What I would like to do would be something like:
>>>>> >>>>>>
>>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>>> >>>>>>     @Override
>>>>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws IOException, ClassNotFoundException {
>>>>> >>>>>>         if (!desc.getName().equals(type.getName())) {
>>>>> >>>>>>             throw new InvalidClassException("Unauthorized deserialization attempt", desc.getName());
>>>>> >>>>>>         }
>>>>> >>>>>>         return super.resolveClass(desc);
>>>>> >>>>>>     }
>>>>> >>>>>> };
>>>>> >>>>>> return type.cast(ois.readObject());
>>>>> >>>>>>
>>>>> >>>>>> This would prevent a possible security hole where an attacker could try to force the recipient of the input stream to deserialize to a gadget class or the like for a RCE.
>>>>> >>>>>>
>>>>> >>>>>> The question is - does the deserialized type have to correspond exactly to the supplied Class? Or is it supported that it's a base type / abstract class? If the latter then my idea won't really work. But if the type corresponds exactly then it should work OK.
>>>>> >>>>>>
>>>>> >>>>>> Thanks,
>>>>> >>>>>>
>>>>> >>>>>> Colm.

Re: Question on SerializableCoder

Posted by Luke Cwik <lc...@google.com>.

From the private@ thread, I suggested:
"With the JvmInitializer[1] being supported by Dataflow and the portable
Java container, users would be able to write code which sets the system
property jdk.serialFilter or by configuring
ObjectInputFilter.Config.setSerialFilter(filter)[2]"

This could become a documentation change to SerializableCoder.

1:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
2:
https://docs.oracle.com/javase/10/core/serialization-filtering1.htm#JSCOR-GUID-952E2328-AB66-4412-8B6B-3BCCB3195C25

On Thu, Mar 26, 2020 at 4:51 AM Colm O hEigeartaigh <co...@apache.org>
wrote:

> Thanks for all the feedback. Two possible ideas that occur to me:
>
>  - Create TypeSafeSerializableCoder or somesuch which extends
> SerializableCoder and enforces the check as in the PR. Update the
> documentation to suggest using the new coder if you don't need to support
> collections as the raw type or subclasses.
>  - Add a new "of" method to SerializableCoder which takes a boolean
> parameter to control whether we perform this check or not (defaults to
> true?).
>
> Colm.
>
> On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <lc...@google.com> wrote:
>
>> I have seen people ingest data using SerializableCoder.
>>
>> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> I won't bring other people's words from private@, but can share mine. I
>>> don't believe it exposes anything new.
>>>
>>> > If it is SerializableCoder - attacker controls the other end of e.g.
>>> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have
>>> an allowlist or try to automatically construct an allowlist] and otherwise
>>> there is no vulnerability for internal coders.
>>>
>>> I have never seen or heard of a user doing dynamic deserialization
>>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is
>>> important to someone then they would need a more secure solution than
>>> SerializableCoder.
>>>
>>> Side note: it would be great to provide an efficient and usable solution
>>> for the problem of wanting to dynamically dispatch serde in the middle of a
>>> pipeline. It is actually independent from being able to provide coders for
>>> a wide variety of types, which we can do a bunch of different (mostly
>>> better) ways. (has a better solution been built since the last time I
>>> thought about this?)
>>>
>>> Kenn
>>>
>>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>>> The link to the previous covnersation (discussion happened in private@
>>>> and I suppose we can bring some relevant bits here if needed)
>>>>
>>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>>>>
>>>> I remember Robert had some points there, but I am not sure we
>>>> found/agreed on a solution that was relevant and did not break current
>>>> users and their user experience (like the case of blacklists).
>>>>
>>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
>>>> >
>>>> > Being able to have something that can encode any object (or at least
>>>> a large class of objects) is extremely powerful so requiring
>>>> SerializableCoder<T> to only encode T.class would hurt our users.
>>>> >
>>>> > I believe someone looked at this kind of problem before and we came
>>>> to an agreement of usng an explicit approve/deny list on the class names
>>>> which would address the security concern. I don't remember the thread
>>>> though and couldn't find the thread after a few minutes of searching.
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>> >>
>>>> >> So you think the spec for SerializableCoder<T> (currently doesn't
>>>> really have one) should be that it dynamically dispatches what it
>>>> deserializes? I had imagined we would treat it more as a statically
>>>> determined coder, so because it is invariant in T we would not allow up or
>>>> down casts (they are unsafe). But we probably don't actually have the
>>>> static information to do that anyhow so you are probably right.
>>>> >>
>>>> >> I wonder about the threat model here. Is this the event that the
>>>> runner (managed service or bespoke cluster) is compromised and is
>>>> attempting RCE on the Java SDK harness or runner-specific Java-based worker?
>>>> >>
>>>> >> Kenn
>>>> >>
>>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>>>> >>>
>>>> >>> I don't think this is going to work since SerializableCoder<T>
>>>> should be able to decode T and all objects that implement/extend T. I'm
>>>> pretty sure SerializableCoder<Set/List/...> is common enough while the
>>>> concrete type is HashSet/ArrayList/...
>>>> >>> I'm pretty sure there is some way you could come up with some way
>>>> for making this optin though.
>>>> >>>
>>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
>>>> coheigea@apache.org> wrote:
>>>> >>>>
>>>> >>>> Thanks Kenn. I submitted a PR here:
>>>> https://github.com/apache/beam/pull/11191
>>>> >>>>
>>>> >>>> Colm.
>>>> >>>>
>>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> I think this is fine. The same coder is used for encode and
>>>> decode, so the Class object should be the same as well. Inheritance is not
>>>> part of the Beam model (thank goodness) so this is a language-specific
>>>> concern. As far as the model is concerned, the full URN and the payload of
>>>> the coder is its identity and coders with different identities have no
>>>> inheritance or compatibility relationship. Pipeline snapshot/update is an
>>>> edge case, but changing coder is not supported by any runner I know of, and
>>>> probably won't be until we have some rather large new ideas.
>>>> >>>>>
>>>> >>>>> Kenn
>>>> >>>>>
>>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
>>>> coheigea@apache.org> wrote:
>>>> >>>>>>
>>>> >>>>>> Hi,
>>>> >>>>>>
>>>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening
>>>> the Java Object deserialization that is taking place. We have a "Class<T>
>>>> type" that is used to decode the input stream:
>>>> >>>>>>
>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>> >>>>>> return type.cast(ois.readObject());
>>>> >>>>>>
>>>> >>>>>> What I would like to do would be something like:
>>>> >>>>>>
>>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>> >>>>>>     @Override
>>>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc)
>>>> throws IOException, ClassNotFoundException {
>>>> >>>>>>         if (!desc.getName().equals(type.getName())) {
>>>> >>>>>>             throw new InvalidClassException("Unauthorized
>>>> deserialization attempt", desc.getName());
>>>> >>>>>>         }
>>>> >>>>>>         return super.resolveClass(desc);
>>>> >>>>>>     }
>>>> >>>>>> };
>>>> >>>>>> return type.cast(ois.readObject());
>>>> >>>>>>
>>>> >>>>>> This would prevent a possible security hole where an attacker
>>>> could try to force the recipient of the input stream to deserialize to a
>>>> gadget class or the like for a RCE.
>>>> >>>>>>
>>>> >>>>>> The question is - does the deserialized type have to correspond
>>>> exactly to the supplied Class? Or is it supported that it's a base type /
>>>> abstract class? If the latter then my idea won't really work. But if the
>>>> type corresponds exactly then it should work OK.
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>>
>>>> >>>>>> Colm.
>>>>
>>>

Re: Question on SerializableCoder

Posted by Colm O hEigeartaigh <co...@apache.org>.

Thanks for all the feedback. Two possible ideas that occur to me:

 - Create TypeSafeSerializableCoder or somesuch which extends
SerializableCoder and enforces the check as in the PR. Update the
documentation to suggest using the new coder if you don't need to support
collections as the raw type or subclasses.
 - Add a new "of" method to SerializableCoder which takes a boolean
parameter to control whether we perform this check or not (defaults to
true?).

Colm.

On Tue, Mar 24, 2020 at 2:11 AM Luke Cwik <lc...@google.com> wrote:

> I have seen people ingest data using SerializableCoder.
>
> On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I won't bring other people's words from private@, but can share mine. I
>> don't believe it exposes anything new.
>>
>> > If it is SerializableCoder - attacker controls the other end of e.g.
>> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have
>> an allowlist or try to automatically construct an allowlist] and otherwise
>> there is no vulnerability for internal coders.
>>
>> I have never seen or heard of a user doing dynamic deserialization
>> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is
>> important to someone then they would need a more secure solution than
>> SerializableCoder.
>>
>> Side note: it would be great to provide an efficient and usable solution
>> for the problem of wanting to dynamically dispatch serde in the middle of a
>> pipeline. It is actually independent from being able to provide coders for
>> a wide variety of types, which we can do a bunch of different (mostly
>> better) ways. (has a better solution been built since the last time I
>> thought about this?)
>>
>> Kenn
>>
>> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> The link to the previous covnersation (discussion happened in private@
>>> and I suppose we can bring some relevant bits here if needed)
>>>
>>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>>>
>>> I remember Robert had some points there, but I am not sure we
>>> found/agreed on a solution that was relevant and did not break current
>>> users and their user experience (like the case of blacklists).
>>>
>>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > Being able to have something that can encode any object (or at least a
>>> large class of objects) is extremely powerful so requiring
>>> SerializableCoder<T> to only encode T.class would hurt our users.
>>> >
>>> > I believe someone looked at this kind of problem before and we came to
>>> an agreement of usng an explicit approve/deny list on the class names which
>>> would address the security concern. I don't remember the thread though and
>>> couldn't find the thread after a few minutes of searching.
>>> >
>>> >
>>> >
>>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >>
>>> >> So you think the spec for SerializableCoder<T> (currently doesn't
>>> really have one) should be that it dynamically dispatches what it
>>> deserializes? I had imagined we would treat it more as a statically
>>> determined coder, so because it is invariant in T we would not allow up or
>>> down casts (they are unsafe). But we probably don't actually have the
>>> static information to do that anyhow so you are probably right.
>>> >>
>>> >> I wonder about the threat model here. Is this the event that the
>>> runner (managed service or bespoke cluster) is compromised and is
>>> attempting RCE on the Java SDK harness or runner-specific Java-based worker?
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>>> >>>
>>> >>> I don't think this is going to work since SerializableCoder<T>
>>> should be able to decode T and all objects that implement/extend T. I'm
>>> pretty sure SerializableCoder<Set/List/...> is common enough while the
>>> concrete type is HashSet/ArrayList/...
>>> >>> I'm pretty sure there is some way you could come up with some way
>>> for making this optin though.
>>> >>>
>>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
>>> coheigea@apache.org> wrote:
>>> >>>>
>>> >>>> Thanks Kenn. I submitted a PR here:
>>> https://github.com/apache/beam/pull/11191
>>> >>>>
>>> >>>> Colm.
>>> >>>>
>>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >>>>>
>>> >>>>> I think this is fine. The same coder is used for encode and
>>> decode, so the Class object should be the same as well. Inheritance is not
>>> part of the Beam model (thank goodness) so this is a language-specific
>>> concern. As far as the model is concerned, the full URN and the payload of
>>> the coder is its identity and coders with different identities have no
>>> inheritance or compatibility relationship. Pipeline snapshot/update is an
>>> edge case, but changing coder is not supported by any runner I know of, and
>>> probably won't be until we have some rather large new ideas.
>>> >>>>>
>>> >>>>> Kenn
>>> >>>>>
>>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
>>> coheigea@apache.org> wrote:
>>> >>>>>>
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening
>>> the Java Object deserialization that is taking place. We have a "Class<T>
>>> type" that is used to decode the input stream:
>>> >>>>>>
>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>> >>>>>> return type.cast(ois.readObject());
>>> >>>>>>
>>> >>>>>> What I would like to do would be something like:
>>> >>>>>>
>>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>> >>>>>>     @Override
>>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc)
>>> throws IOException, ClassNotFoundException {
>>> >>>>>>         if (!desc.getName().equals(type.getName())) {
>>> >>>>>>             throw new InvalidClassException("Unauthorized
>>> deserialization attempt", desc.getName());
>>> >>>>>>         }
>>> >>>>>>         return super.resolveClass(desc);
>>> >>>>>>     }
>>> >>>>>> };
>>> >>>>>> return type.cast(ois.readObject());
>>> >>>>>>
>>> >>>>>> This would prevent a possible security hole where an attacker
>>> could try to force the recipient of the input stream to deserialize to a
>>> gadget class or the like for a RCE.
>>> >>>>>>
>>> >>>>>> The question is - does the deserialized type have to correspond
>>> exactly to the supplied Class? Or is it supported that it's a base type /
>>> abstract class? If the latter then my idea won't really work. But if the
>>> type corresponds exactly then it should work OK.
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>>
>>> >>>>>> Colm.
>>>
>>

Re: Question on SerializableCoder

Posted by Luke Cwik <lc...@google.com>.

I have seen people ingest data using SerializableCoder.

On Mon, Mar 23, 2020 at 2:51 PM Kenneth Knowles <ke...@apache.org> wrote:

> I won't bring other people's words from private@, but can share mine. I
> don't believe it exposes anything new.
>
> > If it is SerializableCoder - attacker controls the other end of e.g.
> Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have
> an allowlist or try to automatically construct an allowlist] and otherwise
> there is no vulnerability for internal coders.
>
> I have never seen or heard of a user doing dynamic deserialization
> dispatch on ingestion, but that doesn't mean it doesn't happen. If it is
> important to someone then they would need a more secure solution than
> SerializableCoder.
>
> Side note: it would be great to provide an efficient and usable solution
> for the problem of wanting to dynamically dispatch serde in the middle of a
> pipeline. It is actually independent from being able to provide coders for
> a wide variety of types, which we can do a bunch of different (mostly
> better) ways. (has a better solution been built since the last time I
> thought about this?)
>
> Kenn
>
> On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> The link to the previous covnersation (discussion happened in private@
>> and I suppose we can bring some relevant bits here if needed)
>>
>> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>>
>> I remember Robert had some points there, but I am not sure we
>> found/agreed on a solution that was relevant and did not break current
>> users and their user experience (like the case of blacklists).
>>
>> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
>> >
>> > Being able to have something that can encode any object (or at least a
>> large class of objects) is extremely powerful so requiring
>> SerializableCoder<T> to only encode T.class would hurt our users.
>> >
>> > I believe someone looked at this kind of problem before and we came to
>> an agreement of usng an explicit approve/deny list on the class names which
>> would address the security concern. I don't remember the thread though and
>> couldn't find the thread after a few minutes of searching.
>> >
>> >
>> >
>> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>
>> >> So you think the spec for SerializableCoder<T> (currently doesn't
>> really have one) should be that it dynamically dispatches what it
>> deserializes? I had imagined we would treat it more as a statically
>> determined coder, so because it is invariant in T we would not allow up or
>> down casts (they are unsafe). But we probably don't actually have the
>> static information to do that anyhow so you are probably right.
>> >>
>> >> I wonder about the threat model here. Is this the event that the
>> runner (managed service or bespoke cluster) is compromised and is
>> attempting RCE on the Java SDK harness or runner-specific Java-based worker?
>> >>
>> >> Kenn
>> >>
>> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>> >>>
>> >>> I don't think this is going to work since SerializableCoder<T> should
>> be able to decode T and all objects that implement/extend T. I'm pretty
>> sure SerializableCoder<Set/List/...> is common enough while the concrete
>> type is HashSet/ArrayList/...
>> >>> I'm pretty sure there is some way you could come up with some way for
>> making this optin though.
>> >>>
>> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
>> coheigea@apache.org> wrote:
>> >>>>
>> >>>> Thanks Kenn. I submitted a PR here:
>> https://github.com/apache/beam/pull/11191
>> >>>>
>> >>>> Colm.
>> >>>>
>> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>>>>
>> >>>>> I think this is fine. The same coder is used for encode and decode,
>> so the Class object should be the same as well. Inheritance is not part of
>> the Beam model (thank goodness) so this is a language-specific concern. As
>> far as the model is concerned, the full URN and the payload of the coder is
>> its identity and coders with different identities have no inheritance or
>> compatibility relationship. Pipeline snapshot/update is an edge case, but
>> changing coder is not supported by any runner I know of, and probably won't
>> be until we have some rather large new ideas.
>> >>>>>
>> >>>>> Kenn
>> >>>>>
>> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
>> coheigea@apache.org> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I have a question on SerializableCoder. I'm looking at hardening
>> the Java Object deserialization that is taking place. We have a "Class<T>
>> type" that is used to decode the input stream:
>> >>>>>>
>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>> >>>>>> return type.cast(ois.readObject());
>> >>>>>>
>> >>>>>> What I would like to do would be something like:
>> >>>>>>
>> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>> >>>>>>     @Override
>> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
>> IOException, ClassNotFoundException {
>> >>>>>>         if (!desc.getName().equals(type.getName())) {
>> >>>>>>             throw new InvalidClassException("Unauthorized
>> deserialization attempt", desc.getName());
>> >>>>>>         }
>> >>>>>>         return super.resolveClass(desc);
>> >>>>>>     }
>> >>>>>> };
>> >>>>>> return type.cast(ois.readObject());
>> >>>>>>
>> >>>>>> This would prevent a possible security hole where an attacker
>> could try to force the recipient of the input stream to deserialize to a
>> gadget class or the like for a RCE.
>> >>>>>>
>> >>>>>> The question is - does the deserialized type have to correspond
>> exactly to the supplied Class? Or is it supported that it's a base type /
>> abstract class? If the latter then my idea won't really work. But if the
>> type corresponds exactly then it should work OK.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>> Colm.
>>
>

Re: Question on SerializableCoder

Posted by Kenneth Knowles <ke...@apache.org>.

I won't bring other people's words from private@, but can share mine. I
don't believe it exposes anything new.

> If it is SerializableCoder - attacker controls the other end of e.g.
Kafka or Pubsub that is decoding w/ ObjectInputStream - [then we could have
an allowlist or try to automatically construct an allowlist] and otherwise
there is no vulnerability for internal coders.

I have never seen or heard of a user doing dynamic deserialization dispatch
on ingestion, but that doesn't mean it doesn't happen. If it is important
to someone then they would need a more secure solution than
SerializableCoder.

Side note: it would be great to provide an efficient and usable solution
for the problem of wanting to dynamically dispatch serde in the middle of a
pipeline. It is actually independent from being able to provide coders for
a wide variety of types, which we can do a bunch of different (mostly
better) ways. (has a better solution been built since the last time I
thought about this?)

Kenn

On Mon, Mar 23, 2020 at 1:36 PM Ismaël Mejía <ie...@gmail.com> wrote:

> The link to the previous covnersation (discussion happened in private@
> and I suppose we can bring some relevant bits here if needed)
>
> https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E
>
> I remember Robert had some points there, but I am not sure we
> found/agreed on a solution that was relevant and did not break current
> users and their user experience (like the case of blacklists).
>
> On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
> >
> > Being able to have something that can encode any object (or at least a
> large class of objects) is extremely powerful so requiring
> SerializableCoder<T> to only encode T.class would hurt our users.
> >
> > I believe someone looked at this kind of problem before and we came to
> an agreement of usng an explicit approve/deny list on the class names which
> would address the security concern. I don't remember the thread though and
> couldn't find the thread after a few minutes of searching.
> >
> >
> >
> > On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org> wrote:
> >>
> >> So you think the spec for SerializableCoder<T> (currently doesn't
> really have one) should be that it dynamically dispatches what it
> deserializes? I had imagined we would treat it more as a statically
> determined coder, so because it is invariant in T we would not allow up or
> down casts (they are unsafe). But we probably don't actually have the
> static information to do that anyhow so you are probably right.
> >>
> >> I wonder about the threat model here. Is this the event that the runner
> (managed service or bespoke cluster) is compromised and is attempting RCE
> on the Java SDK harness or runner-specific Java-based worker?
> >>
> >> Kenn
> >>
> >> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
> >>>
> >>> I don't think this is going to work since SerializableCoder<T> should
> be able to decode T and all objects that implement/extend T. I'm pretty
> sure SerializableCoder<Set/List/...> is common enough while the concrete
> type is HashSet/ArrayList/...
> >>> I'm pretty sure there is some way you could come up with some way for
> making this optin though.
> >>>
> >>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <
> coheigea@apache.org> wrote:
> >>>>
> >>>> Thanks Kenn. I submitted a PR here:
> https://github.com/apache/beam/pull/11191
> >>>>
> >>>> Colm.
> >>>>
> >>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>>
> >>>>> I think this is fine. The same coder is used for encode and decode,
> so the Class object should be the same as well. Inheritance is not part of
> the Beam model (thank goodness) so this is a language-specific concern. As
> far as the model is concerned, the full URN and the payload of the coder is
> its identity and coders with different identities have no inheritance or
> compatibility relationship. Pipeline snapshot/update is an edge case, but
> changing coder is not supported by any runner I know of, and probably won't
> be until we have some rather large new ideas.
> >>>>>
> >>>>> Kenn
> >>>>>
> >>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
> coheigea@apache.org> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have a question on SerializableCoder. I'm looking at hardening
> the Java Object deserialization that is taking place. We have a "Class<T>
> type" that is used to decode the input stream:
> >>>>>>
> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
> >>>>>> return type.cast(ois.readObject());
> >>>>>>
> >>>>>> What I would like to do would be something like:
> >>>>>>
> >>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
> >>>>>>     @Override
> >>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
> IOException, ClassNotFoundException {
> >>>>>>         if (!desc.getName().equals(type.getName())) {
> >>>>>>             throw new InvalidClassException("Unauthorized
> deserialization attempt", desc.getName());
> >>>>>>         }
> >>>>>>         return super.resolveClass(desc);
> >>>>>>     }
> >>>>>> };
> >>>>>> return type.cast(ois.readObject());
> >>>>>>
> >>>>>> This would prevent a possible security hole where an attacker could
> try to force the recipient of the input stream to deserialize to a gadget
> class or the like for a RCE.
> >>>>>>
> >>>>>> The question is - does the deserialized type have to correspond
> exactly to the supplied Class? Or is it supported that it's a base type /
> abstract class? If the latter then my idea won't really work. But if the
> type corresponds exactly then it should work OK.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Colm.
>

Re: Question on SerializableCoder

Posted by Ismaël Mejía <ie...@gmail.com>.

The link to the previous covnersation (discussion happened in private@
and I suppose we can bring some relevant bits here if needed)
https://lists.apache.org/thread.html/2e1c00999e992e15b08938866bfe7bd3c3d3b3d4d7aa2f8f6eb4600d%40%3Cprivate.beam.apache.org%3E

I remember Robert had some points there, but I am not sure we
found/agreed on a solution that was relevant and did not break current
users and their user experience (like the case of blacklists).

On Mon, Mar 23, 2020 at 9:22 PM Luke Cwik <lc...@google.com> wrote:
>
> Being able to have something that can encode any object (or at least a large class of objects) is extremely powerful so requiring SerializableCoder<T> to only encode T.class would hurt our users.
>
> I believe someone looked at this kind of problem before and we came to an agreement of usng an explicit approve/deny list on the class names which would address the security concern. I don't remember the thread though and couldn't find the thread after a few minutes of searching.
>
>
>
> On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>> So you think the spec for SerializableCoder<T> (currently doesn't really have one) should be that it dynamically dispatches what it deserializes? I had imagined we would treat it more as a statically determined coder, so because it is invariant in T we would not allow up or down casts (they are unsafe). But we probably don't actually have the static information to do that anyhow so you are probably right.
>>
>> I wonder about the threat model here. Is this the event that the runner (managed service or bespoke cluster) is compromised and is attempting RCE on the Java SDK harness or runner-specific Java-based worker?
>>
>> Kenn
>>
>> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>> I don't think this is going to work since SerializableCoder<T> should be able to decode T and all objects that implement/extend T. I'm pretty sure SerializableCoder<Set/List/...> is common enough while the concrete type is HashSet/ArrayList/...
>>> I'm pretty sure there is some way you could come up with some way for making this optin though.
>>>
>>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <co...@apache.org> wrote:
>>>>
>>>> Thanks Kenn. I submitted a PR here: https://github.com/apache/beam/pull/11191
>>>>
>>>> Colm.
>>>>
>>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>
>>>>> I think this is fine. The same coder is used for encode and decode, so the Class object should be the same as well. Inheritance is not part of the Beam model (thank goodness) so this is a language-specific concern. As far as the model is concerned, the full URN and the payload of the coder is its identity and coders with different identities have no inheritance or compatibility relationship. Pipeline snapshot/update is an edge case, but changing coder is not supported by any runner I know of, and probably won't be until we have some rather large new ideas.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <co...@apache.org> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a question on SerializableCoder. I'm looking at hardening the Java Object deserialization that is taking place. We have a "Class<T> type" that is used to decode the input stream:
>>>>>>
>>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>>>> return type.cast(ois.readObject());
>>>>>>
>>>>>> What I would like to do would be something like:
>>>>>>
>>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>>>>     @Override
>>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws IOException, ClassNotFoundException {
>>>>>>         if (!desc.getName().equals(type.getName())) {
>>>>>>             throw new InvalidClassException("Unauthorized deserialization attempt", desc.getName());
>>>>>>         }
>>>>>>         return super.resolveClass(desc);
>>>>>>     }
>>>>>> };
>>>>>> return type.cast(ois.readObject());
>>>>>>
>>>>>> This would prevent a possible security hole where an attacker could try to force the recipient of the input stream to deserialize to a gadget class or the like for a RCE.
>>>>>>
>>>>>> The question is - does the deserialized type have to correspond exactly to the supplied Class? Or is it supported that it's a base type / abstract class? If the latter then my idea won't really work. But if the type corresponds exactly then it should work OK.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Colm.

Re: Question on SerializableCoder

Posted by Luke Cwik <lc...@google.com>.

Being able to have something that can encode any object (or at least a
large class of objects) is extremely powerful so requiring
SerializableCoder<T> to only encode T.class would hurt our users.

I believe someone looked at this kind of problem before and we came to an
agreement of usng an explicit approve/deny list on the class names which
would address the security concern. I don't remember the thread though and
couldn't find the thread after a few minutes of searching.



On Mon, Mar 23, 2020 at 1:07 PM Kenneth Knowles <ke...@apache.org> wrote:

> So you think the spec for SerializableCoder<T> (currently doesn't really
> have one) should be that it dynamically dispatches what it deserializes? I
> had imagined we would treat it more as a statically determined coder, so
> because it is invariant in T we would not allow up or down casts (they are
> unsafe). But we probably don't actually have the static information to do
> that anyhow so you are probably right.
>
> I wonder about the threat model here. Is this the event that the runner
> (managed service or bespoke cluster) is compromised and is attempting RCE
> on the Java SDK harness or runner-specific Java-based worker?
>
> Kenn
>
> On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:
>
>> I don't think this is going to work since SerializableCoder<T> should be
>> able to decode T and all objects that implement/extend T. I'm pretty sure
>> SerializableCoder<Set/List/...> is common enough while the concrete type is
>> HashSet/ArrayList/...
>> I'm pretty sure there is some way you could come up with some way for
>> making this optin though.
>>
>> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <co...@apache.org>
>> wrote:
>>
>>> Thanks Kenn. I submitted a PR here:
>>> https://github.com/apache/beam/pull/11191
>>>
>>> Colm.
>>>
>>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> I think this is fine. The same coder is used for encode and decode, so
>>>> the Class object should be the same as well. Inheritance is not part of the
>>>> Beam model (thank goodness) so this is a language-specific concern. As far
>>>> as the model is concerned, the full URN and the payload of the coder is its
>>>> identity and coders with different identities have no inheritance or
>>>> compatibility relationship. Pipeline snapshot/update is an edge case, but
>>>> changing coder is not supported by any runner I know of, and probably won't
>>>> be until we have some rather large new ideas.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <
>>>> coheigea@apache.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a question on SerializableCoder. I'm looking at hardening the
>>>>> Java Object deserialization that is taking place. We have a "Class<T> type"
>>>>> that is used to decode the input stream:
>>>>>
>>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>>> return type.cast(ois.readObject());
>>>>>
>>>>> What I would like to do would be something like:
>>>>>
>>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>>>     @Override
>>>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
>>>>> IOException, ClassNotFoundException {
>>>>>         if (!desc.getName().equals(type.getName())) {
>>>>>             throw new InvalidClassException("Unauthorized
>>>>> deserialization attempt", desc.getName());
>>>>>         }
>>>>>         return super.resolveClass(desc);
>>>>>     }
>>>>> };
>>>>> return type.cast(ois.readObject());
>>>>>
>>>>> This would prevent a possible security hole where an attacker could
>>>>> try to force the recipient of the input stream to deserialize to a gadget
>>>>> class or the like for a RCE.
>>>>>
>>>>> The question is - does the deserialized type have to correspond
>>>>> exactly to the supplied Class? Or is it supported that it's a base type /
>>>>> abstract class? If the latter then my idea won't really work. But if the
>>>>> type corresponds exactly then it should work OK.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Colm.
>>>>>
>>>>

Re: Question on SerializableCoder

Posted by Kenneth Knowles <ke...@apache.org>.

So you think the spec for SerializableCoder<T> (currently doesn't really
have one) should be that it dynamically dispatches what it deserializes? I
had imagined we would treat it more as a statically determined coder, so
because it is invariant in T we would not allow up or down casts (they are
unsafe). But we probably don't actually have the static information to do
that anyhow so you are probably right.

I wonder about the threat model here. Is this the event that the runner
(managed service or bespoke cluster) is compromised and is attempting RCE
on the Java SDK harness or runner-specific Java-based worker?

Kenn

On Mon, Mar 23, 2020 at 8:09 AM Luke Cwik <lc...@google.com> wrote:

> I don't think this is going to work since SerializableCoder<T> should be
> able to decode T and all objects that implement/extend T. I'm pretty sure
> SerializableCoder<Set/List/...> is common enough while the concrete type is
> HashSet/ArrayList/...
> I'm pretty sure there is some way you could come up with some way for
> making this optin though.
>
> On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <co...@apache.org>
> wrote:
>
>> Thanks Kenn. I submitted a PR here:
>> https://github.com/apache/beam/pull/11191
>>
>> Colm.
>>
>> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> I think this is fine. The same coder is used for encode and decode, so
>>> the Class object should be the same as well. Inheritance is not part of the
>>> Beam model (thank goodness) so this is a language-specific concern. As far
>>> as the model is concerned, the full URN and the payload of the coder is its
>>> identity and coders with different identities have no inheritance or
>>> compatibility relationship. Pipeline snapshot/update is an edge case, but
>>> changing coder is not supported by any runner I know of, and probably won't
>>> be until we have some rather large new ideas.
>>>
>>> Kenn
>>>
>>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <co...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question on SerializableCoder. I'm looking at hardening the
>>>> Java Object deserialization that is taking place. We have a "Class<T> type"
>>>> that is used to decode the input stream:
>>>>
>>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>>> return type.cast(ois.readObject());
>>>>
>>>> What I would like to do would be something like:
>>>>
>>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>>     @Override
>>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
>>>> IOException, ClassNotFoundException {
>>>>         if (!desc.getName().equals(type.getName())) {
>>>>             throw new InvalidClassException("Unauthorized
>>>> deserialization attempt", desc.getName());
>>>>         }
>>>>         return super.resolveClass(desc);
>>>>     }
>>>> };
>>>> return type.cast(ois.readObject());
>>>>
>>>> This would prevent a possible security hole where an attacker could try
>>>> to force the recipient of the input stream to deserialize to a gadget class
>>>> or the like for a RCE.
>>>>
>>>> The question is - does the deserialized type have to correspond exactly
>>>> to the supplied Class? Or is it supported that it's a base type / abstract
>>>> class? If the latter then my idea won't really work. But if the type
>>>> corresponds exactly then it should work OK.
>>>>
>>>> Thanks,
>>>>
>>>> Colm.
>>>>
>>>

Re: Question on SerializableCoder

Posted by Luke Cwik <lc...@google.com>.

I don't think this is going to work since SerializableCoder<T> should be
able to decode T and all objects that implement/extend T. I'm pretty sure
SerializableCoder<Set/List/...> is common enough while the concrete type is
HashSet/ArrayList/...
I'm pretty sure there is some way you could come up with some way for
making this optin though.

On Mon, Mar 23, 2020 at 12:19 AM Colm O hEigeartaigh <co...@apache.org>
wrote:

> Thanks Kenn. I submitted a PR here:
> https://github.com/apache/beam/pull/11191
>
> Colm.
>
> On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I think this is fine. The same coder is used for encode and decode, so
>> the Class object should be the same as well. Inheritance is not part of the
>> Beam model (thank goodness) so this is a language-specific concern. As far
>> as the model is concerned, the full URN and the payload of the coder is its
>> identity and coders with different identities have no inheritance or
>> compatibility relationship. Pipeline snapshot/update is an edge case, but
>> changing coder is not supported by any runner I know of, and probably won't
>> be until we have some rather large new ideas.
>>
>> Kenn
>>
>> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <co...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a question on SerializableCoder. I'm looking at hardening the
>>> Java Object deserialization that is taking place. We have a "Class<T> type"
>>> that is used to decode the input stream:
>>>
>>> ObjectInputStream ois = new ObjectInputStream(inStream);
>>> return type.cast(ois.readObject());
>>>
>>> What I would like to do would be something like:
>>>
>>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>>     @Override
>>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
>>> IOException, ClassNotFoundException {
>>>         if (!desc.getName().equals(type.getName())) {
>>>             throw new InvalidClassException("Unauthorized
>>> deserialization attempt", desc.getName());
>>>         }
>>>         return super.resolveClass(desc);
>>>     }
>>> };
>>> return type.cast(ois.readObject());
>>>
>>> This would prevent a possible security hole where an attacker could try
>>> to force the recipient of the input stream to deserialize to a gadget class
>>> or the like for a RCE.
>>>
>>> The question is - does the deserialized type have to correspond exactly
>>> to the supplied Class? Or is it supported that it's a base type / abstract
>>> class? If the latter then my idea won't really work. But if the type
>>> corresponds exactly then it should work OK.
>>>
>>> Thanks,
>>>
>>> Colm.
>>>
>>

Re: Question on SerializableCoder

Posted by Colm O hEigeartaigh <co...@apache.org>.

Thanks Kenn. I submitted a PR here:
https://github.com/apache/beam/pull/11191

Colm.

On Thu, Mar 19, 2020 at 8:13 PM Kenneth Knowles <ke...@apache.org> wrote:

> I think this is fine. The same coder is used for encode and decode, so the
> Class object should be the same as well. Inheritance is not part of the
> Beam model (thank goodness) so this is a language-specific concern. As far
> as the model is concerned, the full URN and the payload of the coder is its
> identity and coders with different identities have no inheritance or
> compatibility relationship. Pipeline snapshot/update is an edge case, but
> changing coder is not supported by any runner I know of, and probably won't
> be until we have some rather large new ideas.
>
> Kenn
>
> On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <co...@apache.org>
> wrote:
>
>> Hi,
>>
>> I have a question on SerializableCoder. I'm looking at hardening the Java
>> Object deserialization that is taking place. We have a "Class<T> type" that
>> is used to decode the input stream:
>>
>> ObjectInputStream ois = new ObjectInputStream(inStream);
>> return type.cast(ois.readObject());
>>
>> What I would like to do would be something like:
>>
>> ObjectInputStream ois = new ObjectInputStream(inStream) {
>>     @Override
>>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
>> IOException, ClassNotFoundException {
>>         if (!desc.getName().equals(type.getName())) {
>>             throw new InvalidClassException("Unauthorized deserialization
>> attempt", desc.getName());
>>         }
>>         return super.resolveClass(desc);
>>     }
>> };
>> return type.cast(ois.readObject());
>>
>> This would prevent a possible security hole where an attacker could try
>> to force the recipient of the input stream to deserialize to a gadget class
>> or the like for a RCE.
>>
>> The question is - does the deserialized type have to correspond exactly
>> to the supplied Class? Or is it supported that it's a base type / abstract
>> class? If the latter then my idea won't really work. But if the type
>> corresponds exactly then it should work OK.
>>
>> Thanks,
>>
>> Colm.
>>
>

Re: Question on SerializableCoder

Posted by Kenneth Knowles <ke...@apache.org>.

I think this is fine. The same coder is used for encode and decode, so the
Class object should be the same as well. Inheritance is not part of the
Beam model (thank goodness) so this is a language-specific concern. As far
as the model is concerned, the full URN and the payload of the coder is its
identity and coders with different identities have no inheritance or
compatibility relationship. Pipeline snapshot/update is an edge case, but
changing coder is not supported by any runner I know of, and probably won't
be until we have some rather large new ideas.

Kenn

On Thu, Mar 19, 2020 at 4:50 AM Colm O hEigeartaigh <co...@apache.org>
wrote:

> Hi,
>
> I have a question on SerializableCoder. I'm looking at hardening the Java
> Object deserialization that is taking place. We have a "Class<T> type" that
> is used to decode the input stream:
>
> ObjectInputStream ois = new ObjectInputStream(inStream);
> return type.cast(ois.readObject());
>
> What I would like to do would be something like:
>
> ObjectInputStream ois = new ObjectInputStream(inStream) {
>     @Override
>     protected Class<?> resolveClass(ObjectStreamClass desc) throws
> IOException, ClassNotFoundException {
>         if (!desc.getName().equals(type.getName())) {
>             throw new InvalidClassException("Unauthorized deserialization
> attempt", desc.getName());
>         }
>         return super.resolveClass(desc);
>     }
> };
> return type.cast(ois.readObject());
>
> This would prevent a possible security hole where an attacker could try to
> force the recipient of the input stream to deserialize to a gadget class or
> the like for a RCE.
>
> The question is - does the deserialized type have to correspond exactly to
> the supplied Class? Or is it supported that it's a base type / abstract
> class? If the latter then my idea won't really work. But if the type
> corresponds exactly then it should work OK.
>
> Thanks,
>
> Colm.
>