You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Jack Ye <ye...@gmail.com> on 2021/07/17 09:02:44 UTC

Java Deserialization Vulnerability

Hi everyone,

We use Java serialization and deserialization a lot in Iceberg. I wonder if
we have considered the potential of Java deserialization attack, where an
attacker can replace serialized bytes to execute arbitrary code through the
readObject method.

Currently our SerializationUtil.deserializeFromBytes directly converts
bytes to an ObjectInputStream. I know Apache commons have
ValidatingObjectInputStream which can prevent the issue to some extent.

Have we thought about this issue in the past? Are there any other
suggestions?

Best,
Jack Ye

Re: Java Deserialization Vulnerability

Posted by Steven Wu <st...@gmail.com>.
Yeah, it is a general Java serialization wider than just Iceberg tables.
Typically Flink won't recommend Java serialization for checkpoint state, as
that won't be able to support schema evolution. Flink has built-in support
for schema evolution for Pojo or Avro data types.

On Mon, Jul 19, 2021 at 4:09 PM Ryan Blue <bl...@tabular.io> wrote:

> Yes, I think so. Sounds like an unsecured bucket could lead to code
> execution running with data infrastructure privileges. While it isn't
> exactly Flink's problem, we should probably treat this like a potential
> privilege escalation issue. How does Flink handle this for other cases? I
> think it would make sense to avoid using Java serialization for checkpoint
> state for Iceberg at least, but it is probably a wider problem than just
> Iceberg tables.
>
> On Mon, Jul 19, 2021 at 3:57 PM Steven Wu <st...@gmail.com> wrote:
>
>> Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs
>> to be able to read from and write to S3 to manipulate the S3 files. Is this
>> the scenario we are concerned about?
>>
>> On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Thanks, Steven. Do you think that there is a potential problem with an
>>> attacker having access to where the state is stored and using that to
>>> inject code? Is this something we should just update to avoid it entirely?
>>>
>>> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <st...@gmail.com> wrote:
>>>
>>>> I believe Flink source is the only place that uses Java serialization
>>>> for checkpoint state: https://github.com/apache/iceberg/issues/1698.
>>>>
>>>>  @OpenInx <op...@gmail.com> already updated Flink sink to avoid the
>>>> Java serialization (long time ago)
>>>>
>>>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Yes I totally agree that the distributed system itself should make
>>>>> sure the integrity of objects passing across nodes. I am more concerned
>>>>> about the Flink case where some information is persisted and can be
>>>>> modified to execute arbitrary code. Maybe people working on Flink can
>>>>> comment on this a bit more.
>>>>>
>>>>> Also, this is dangerous from the perspective that I do not need to
>>>>> modify any existing serialized object, but as long as I can communicate
>>>>> with the node that tries to run the deserialization logic of incoming
>>>>> payload, it will execute the code of getObject method even if the
>>>>> deserialization itself fails. And I believe this is an issue not only for
>>>>> Java serialization but also for Kryo serialization, basically any
>>>>> serialization mechanisms that try to bypass using the public constructor.
>>>>>
>>>>> I am definitely not an expert on this, I only noticed this while
>>>>> adding some features to Trino, where the community just completely avoids
>>>>> the use of Java serialization for such a reason. I think some actions are
>>>>> needed here:
>>>>> 1. make sure we fully understand the implication of our decision to
>>>>> use Java serialization, some warnings are likely needed for organizations
>>>>> that tries to adopt Iceberg to their own compute
>>>>> 2. introduce some mechanisms to allow only trusted classes to be
>>>>> deserialized, which I saw was done to fix some attacks like
>>>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will
>>>>> explore the possibility to at least leverage the
>>>>> ValidatingObjectInputStream.
>>>>> 3. maybe allow some currently protected class constructors to be
>>>>> public, so that engines at least have the option to have their own
>>>>> deserialization mechanism when such an attack cannot be fully mitigated on
>>>>> the engine's side.
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> Jack,
>>>>>>
>>>>>> I might be incorrect here, but I'll at least throw out some thoughts.
>>>>>> If I understand correctly, the attacker requires access to modify some
>>>>>> serialized object so that deserialization leads to arbitrary code
>>>>>> execution. I think that the best way to protect against that is to avoid
>>>>>> making it possible for an attacker to modify serialized bytes.
>>>>>>
>>>>>> To my knowledge, Java serialization is used in two places: first, to
>>>>>> serialize objects between nodes, like sending a task to a Spark executor,
>>>>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>>>>> use Java serialization for anything in the format or long-term storage. For
>>>>>> the first case, I think that it is up to the distributed system passing
>>>>>> objects between nodes to secure the content, like using TLS for connections
>>>>>> between nodes. Since Java serialization is used by the processing engine,
>>>>>> there isn't much Iceberg could do to change this and we have to rely on
>>>>>> Spark or Flink.
>>>>>>
>>>>>> For the second issue, I think our use of Java serialization to store
>>>>>> state is very limited, but we should take a look to make sure. I think this
>>>>>> is one area where Iceberg made the choice to use Java serialization, so we
>>>>>> should look into it and fix it if possible... although I'm not entirely
>>>>>> sure how to avoid swapping out the state that gets loaded.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>>>>> wonder if we have considered the potential of Java deserialization attack,
>>>>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>>>>> through the readObject method.
>>>>>>>
>>>>>>> Currently our SerializationUtil.deserializeFromBytes directly
>>>>>>> converts bytes to an ObjectInputStream. I know Apache commons have
>>>>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>>>>
>>>>>>> Have we thought about this issue in the past? Are there any other
>>>>>>> suggestions?
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Java Deserialization Vulnerability

Posted by Ryan Blue <bl...@tabular.io>.
Yes, I think so. Sounds like an unsecured bucket could lead to code
execution running with data infrastructure privileges. While it isn't
exactly Flink's problem, we should probably treat this like a potential
privilege escalation issue. How does Flink handle this for other cases? I
think it would make sense to avoid using Java serialization for checkpoint
state for Iceberg at least, but it is probably a wider problem than just
Iceberg tables.

On Mon, Jul 19, 2021 at 3:57 PM Steven Wu <st...@gmail.com> wrote:

> Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs
> to be able to read from and write to S3 to manipulate the S3 files. Is this
> the scenario we are concerned about?
>
> On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Thanks, Steven. Do you think that there is a potential problem with an
>> attacker having access to where the state is stored and using that to
>> inject code? Is this something we should just update to avoid it entirely?
>>
>> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <st...@gmail.com> wrote:
>>
>>> I believe Flink source is the only place that uses Java serialization
>>> for checkpoint state: https://github.com/apache/iceberg/issues/1698.
>>>
>>>  @OpenInx <op...@gmail.com> already updated Flink sink to avoid the
>>> Java serialization (long time ago)
>>>
>>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Yes I totally agree that the distributed system itself should make sure
>>>> the integrity of objects passing across nodes. I am more concerned about
>>>> the Flink case where some information is persisted and can be modified to
>>>> execute arbitrary code. Maybe people working on Flink can comment on this a
>>>> bit more.
>>>>
>>>> Also, this is dangerous from the perspective that I do not need to
>>>> modify any existing serialized object, but as long as I can communicate
>>>> with the node that tries to run the deserialization logic of incoming
>>>> payload, it will execute the code of getObject method even if the
>>>> deserialization itself fails. And I believe this is an issue not only for
>>>> Java serialization but also for Kryo serialization, basically any
>>>> serialization mechanisms that try to bypass using the public constructor.
>>>>
>>>> I am definitely not an expert on this, I only noticed this while adding
>>>> some features to Trino, where the community just completely avoids the use
>>>> of Java serialization for such a reason. I think some actions are needed
>>>> here:
>>>> 1. make sure we fully understand the implication of our decision to use
>>>> Java serialization, some warnings are likely needed for organizations that
>>>> tries to adopt Iceberg to their own compute
>>>> 2. introduce some mechanisms to allow only trusted classes to be
>>>> deserialized, which I saw was done to fix some attacks like
>>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will
>>>> explore the possibility to at least leverage the
>>>> ValidatingObjectInputStream.
>>>> 3. maybe allow some currently protected class constructors to be
>>>> public, so that engines at least have the option to have their own
>>>> deserialization mechanism when such an attack cannot be fully mitigated on
>>>> the engine's side.
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> Jack,
>>>>>
>>>>> I might be incorrect here, but I'll at least throw out some thoughts.
>>>>> If I understand correctly, the attacker requires access to modify some
>>>>> serialized object so that deserialization leads to arbitrary code
>>>>> execution. I think that the best way to protect against that is to avoid
>>>>> making it possible for an attacker to modify serialized bytes.
>>>>>
>>>>> To my knowledge, Java serialization is used in two places: first, to
>>>>> serialize objects between nodes, like sending a task to a Spark executor,
>>>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>>>> use Java serialization for anything in the format or long-term storage. For
>>>>> the first case, I think that it is up to the distributed system passing
>>>>> objects between nodes to secure the content, like using TLS for connections
>>>>> between nodes. Since Java serialization is used by the processing engine,
>>>>> there isn't much Iceberg could do to change this and we have to rely on
>>>>> Spark or Flink.
>>>>>
>>>>> For the second issue, I think our use of Java serialization to store
>>>>> state is very limited, but we should take a look to make sure. I think this
>>>>> is one area where Iceberg made the choice to use Java serialization, so we
>>>>> should look into it and fix it if possible... although I'm not entirely
>>>>> sure how to avoid swapping out the state that gets loaded.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>>>> wonder if we have considered the potential of Java deserialization attack,
>>>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>>>> through the readObject method.
>>>>>>
>>>>>> Currently our SerializationUtil.deserializeFromBytes directly
>>>>>> converts bytes to an ObjectInputStream. I know Apache commons have
>>>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>>>
>>>>>> Have we thought about this issue in the past? Are there any other
>>>>>> suggestions?
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Java Deserialization Vulnerability

Posted by Steven Wu <st...@gmail.com>.
Let's assume the Flink checkpoint state is uploaded to S3. Attacker needs
to be able to read from and write to S3 to manipulate the S3 files. Is this
the scenario we are concerned about?

On Mon, Jul 19, 2021 at 3:51 PM Ryan Blue <bl...@tabular.io> wrote:

> Thanks, Steven. Do you think that there is a potential problem with an
> attacker having access to where the state is stored and using that to
> inject code? Is this something we should just update to avoid it entirely?
>
> On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <st...@gmail.com> wrote:
>
>> I believe Flink source is the only place that uses Java serialization for
>> checkpoint state: https://github.com/apache/iceberg/issues/1698.
>>
>>  @OpenInx <op...@gmail.com> already updated Flink sink to avoid the
>> Java serialization (long time ago)
>>
>> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Yes I totally agree that the distributed system itself should make sure
>>> the integrity of objects passing across nodes. I am more concerned about
>>> the Flink case where some information is persisted and can be modified to
>>> execute arbitrary code. Maybe people working on Flink can comment on this a
>>> bit more.
>>>
>>> Also, this is dangerous from the perspective that I do not need to
>>> modify any existing serialized object, but as long as I can communicate
>>> with the node that tries to run the deserialization logic of incoming
>>> payload, it will execute the code of getObject method even if the
>>> deserialization itself fails. And I believe this is an issue not only for
>>> Java serialization but also for Kryo serialization, basically any
>>> serialization mechanisms that try to bypass using the public constructor.
>>>
>>> I am definitely not an expert on this, I only noticed this while adding
>>> some features to Trino, where the community just completely avoids the use
>>> of Java serialization for such a reason. I think some actions are needed
>>> here:
>>> 1. make sure we fully understand the implication of our decision to use
>>> Java serialization, some warnings are likely needed for organizations that
>>> tries to adopt Iceberg to their own compute
>>> 2. introduce some mechanisms to allow only trusted classes to be
>>> deserialized, which I saw was done to fix some attacks like
>>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore
>>> the possibility to at least leverage the ValidatingObjectInputStream.
>>> 3. maybe allow some currently protected class constructors to be public,
>>> so that engines at least have the option to have their own deserialization
>>> mechanism when such an attack cannot be fully mitigated on the engine's
>>> side.
>>>
>>> -Jack
>>>
>>>
>>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Jack,
>>>>
>>>> I might be incorrect here, but I'll at least throw out some thoughts.
>>>> If I understand correctly, the attacker requires access to modify some
>>>> serialized object so that deserialization leads to arbitrary code
>>>> execution. I think that the best way to protect against that is to avoid
>>>> making it possible for an attacker to modify serialized bytes.
>>>>
>>>> To my knowledge, Java serialization is used in two places: first, to
>>>> serialize objects between nodes, like sending a task to a Spark executor,
>>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>>> use Java serialization for anything in the format or long-term storage. For
>>>> the first case, I think that it is up to the distributed system passing
>>>> objects between nodes to secure the content, like using TLS for connections
>>>> between nodes. Since Java serialization is used by the processing engine,
>>>> there isn't much Iceberg could do to change this and we have to rely on
>>>> Spark or Flink.
>>>>
>>>> For the second issue, I think our use of Java serialization to store
>>>> state is very limited, but we should take a look to make sure. I think this
>>>> is one area where Iceberg made the choice to use Java serialization, so we
>>>> should look into it and fix it if possible... although I'm not entirely
>>>> sure how to avoid swapping out the state that gets loaded.
>>>>
>>>> Ryan
>>>>
>>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>>> wonder if we have considered the potential of Java deserialization attack,
>>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>>> through the readObject method.
>>>>>
>>>>> Currently our SerializationUtil.deserializeFromBytes directly converts
>>>>> bytes to an ObjectInputStream. I know Apache commons have
>>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>>
>>>>> Have we thought about this issue in the past? Are there any other
>>>>> suggestions?
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Java Deserialization Vulnerability

Posted by Ryan Blue <bl...@tabular.io>.
Thanks, Steven. Do you think that there is a potential problem with an
attacker having access to where the state is stored and using that to
inject code? Is this something we should just update to avoid it entirely?

On Mon, Jul 19, 2021 at 3:43 PM Steven Wu <st...@gmail.com> wrote:

> I believe Flink source is the only place that uses Java serialization for
> checkpoint state: https://github.com/apache/iceberg/issues/1698.
>
>  @OpenInx <op...@gmail.com> already updated Flink sink to avoid the
> Java serialization (long time ago)
>
> On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Yes I totally agree that the distributed system itself should make sure
>> the integrity of objects passing across nodes. I am more concerned about
>> the Flink case where some information is persisted and can be modified to
>> execute arbitrary code. Maybe people working on Flink can comment on this a
>> bit more.
>>
>> Also, this is dangerous from the perspective that I do not need to modify
>> any existing serialized object, but as long as I can communicate with the
>> node that tries to run the deserialization logic of incoming payload, it
>> will execute the code of getObject method even if the deserialization
>> itself fails. And I believe this is an issue not only for Java
>> serialization but also for Kryo serialization, basically any serialization
>> mechanisms that try to bypass using the public constructor.
>>
>> I am definitely not an expert on this, I only noticed this while adding
>> some features to Trino, where the community just completely avoids the use
>> of Java serialization for such a reason. I think some actions are needed
>> here:
>> 1. make sure we fully understand the implication of our decision to use
>> Java serialization, some warnings are likely needed for organizations that
>> tries to adopt Iceberg to their own compute
>> 2. introduce some mechanisms to allow only trusted classes to be
>> deserialized, which I saw was done to fix some attacks like
>> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore
>> the possibility to at least leverage the ValidatingObjectInputStream.
>> 3. maybe allow some currently protected class constructors to be public,
>> so that engines at least have the option to have their own deserialization
>> mechanism when such an attack cannot be fully mitigated on the engine's
>> side.
>>
>> -Jack
>>
>>
>> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Jack,
>>>
>>> I might be incorrect here, but I'll at least throw out some thoughts. If
>>> I understand correctly, the attacker requires access to modify some
>>> serialized object so that deserialization leads to arbitrary code
>>> execution. I think that the best way to protect against that is to avoid
>>> making it possible for an attacker to modify serialized bytes.
>>>
>>> To my knowledge, Java serialization is used in two places: first, to
>>> serialize objects between nodes, like sending a task to a Spark executor,
>>> and second, to serialize some persistent state in Flink. Iceberg does not
>>> use Java serialization for anything in the format or long-term storage. For
>>> the first case, I think that it is up to the distributed system passing
>>> objects between nodes to secure the content, like using TLS for connections
>>> between nodes. Since Java serialization is used by the processing engine,
>>> there isn't much Iceberg could do to change this and we have to rely on
>>> Spark or Flink.
>>>
>>> For the second issue, I think our use of Java serialization to store
>>> state is very limited, but we should take a look to make sure. I think this
>>> is one area where Iceberg made the choice to use Java serialization, so we
>>> should look into it and fix it if possible... although I'm not entirely
>>> sure how to avoid swapping out the state that gets loaded.
>>>
>>> Ryan
>>>
>>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> We use Java serialization and deserialization a lot in Iceberg. I
>>>> wonder if we have considered the potential of Java deserialization attack,
>>>> where an attacker can replace serialized bytes to execute arbitrary code
>>>> through the readObject method.
>>>>
>>>> Currently our SerializationUtil.deserializeFromBytes directly converts
>>>> bytes to an ObjectInputStream. I know Apache commons have
>>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>>
>>>> Have we thought about this issue in the past? Are there any other
>>>> suggestions?
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: Java Deserialization Vulnerability

Posted by Steven Wu <st...@gmail.com>.
I believe Flink source is the only place that uses Java serialization for
checkpoint state: https://github.com/apache/iceberg/issues/1698.

 @OpenInx <op...@gmail.com> already updated Flink sink to avoid the Java
serialization (long time ago)

On Mon, Jul 19, 2021 at 1:53 PM Jack Ye <ye...@gmail.com> wrote:

> Yes I totally agree that the distributed system itself should make sure
> the integrity of objects passing across nodes. I am more concerned about
> the Flink case where some information is persisted and can be modified to
> execute arbitrary code. Maybe people working on Flink can comment on this a
> bit more.
>
> Also, this is dangerous from the perspective that I do not need to modify
> any existing serialized object, but as long as I can communicate with the
> node that tries to run the deserialization logic of incoming payload, it
> will execute the code of getObject method even if the deserialization
> itself fails. And I believe this is an issue not only for Java
> serialization but also for Kryo serialization, basically any serialization
> mechanisms that try to bypass using the public constructor.
>
> I am definitely not an expert on this, I only noticed this while adding
> some features to Trino, where the community just completely avoids the use
> of Java serialization for such a reason. I think some actions are needed
> here:
> 1. make sure we fully understand the implication of our decision to use
> Java serialization, some warnings are likely needed for organizations that
> tries to adopt Iceberg to their own compute
> 2. introduce some mechanisms to allow only trusted classes to be
> deserialized, which I saw was done to fix some attacks like
> https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore
> the possibility to at least leverage the ValidatingObjectInputStream.
> 3. maybe allow some currently protected class constructors to be public,
> so that engines at least have the option to have their own deserialization
> mechanism when such an attack cannot be fully mitigated on the engine's
> side.
>
> -Jack
>
>
> On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Jack,
>>
>> I might be incorrect here, but I'll at least throw out some thoughts. If
>> I understand correctly, the attacker requires access to modify some
>> serialized object so that deserialization leads to arbitrary code
>> execution. I think that the best way to protect against that is to avoid
>> making it possible for an attacker to modify serialized bytes.
>>
>> To my knowledge, Java serialization is used in two places: first, to
>> serialize objects between nodes, like sending a task to a Spark executor,
>> and second, to serialize some persistent state in Flink. Iceberg does not
>> use Java serialization for anything in the format or long-term storage. For
>> the first case, I think that it is up to the distributed system passing
>> objects between nodes to secure the content, like using TLS for connections
>> between nodes. Since Java serialization is used by the processing engine,
>> there isn't much Iceberg could do to change this and we have to rely on
>> Spark or Flink.
>>
>> For the second issue, I think our use of Java serialization to store
>> state is very limited, but we should take a look to make sure. I think this
>> is one area where Iceberg made the choice to use Java serialization, so we
>> should look into it and fix it if possible... although I'm not entirely
>> sure how to avoid swapping out the state that gets loaded.
>>
>> Ryan
>>
>> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> We use Java serialization and deserialization a lot in Iceberg. I wonder
>>> if we have considered the potential of Java deserialization attack, where
>>> an attacker can replace serialized bytes to execute arbitrary code through
>>> the readObject method.
>>>
>>> Currently our SerializationUtil.deserializeFromBytes directly converts
>>> bytes to an ObjectInputStream. I know Apache commons have
>>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>>
>>> Have we thought about this issue in the past? Are there any other
>>> suggestions?
>>>
>>> Best,
>>> Jack Ye
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Java Deserialization Vulnerability

Posted by Jack Ye <ye...@gmail.com>.
Yes I totally agree that the distributed system itself should make sure the
integrity of objects passing across nodes. I am more concerned about the
Flink case where some information is persisted and can be modified to
execute arbitrary code. Maybe people working on Flink can comment on this a
bit more.

Also, this is dangerous from the perspective that I do not need to modify
any existing serialized object, but as long as I can communicate with the
node that tries to run the deserialization logic of incoming payload, it
will execute the code of getObject method even if the deserialization
itself fails. And I believe this is an issue not only for Java
serialization but also for Kryo serialization, basically any serialization
mechanisms that try to bypass using the public constructor.

I am definitely not an expert on this, I only noticed this while adding
some features to Trino, where the community just completely avoids the use
of Java serialization for such a reason. I think some actions are needed
here:
1. make sure we fully understand the implication of our decision to use
Java serialization, some warnings are likely needed for organizations that
tries to adopt Iceberg to their own compute
2. introduce some mechanisms to allow only trusted classes to be
deserialized, which I saw was done to fix some attacks like
https://nvd.nist.gov/vuln/detail/CVE-2020-5413 for Kryo. I will explore the
possibility to at least leverage the ValidatingObjectInputStream.
3. maybe allow some currently protected class constructors to be public, so
that engines at least have the option to have their own deserialization
mechanism when such an attack cannot be fully mitigated on the engine's
side.

-Jack


On Mon, Jul 19, 2021 at 10:56 AM Ryan Blue <bl...@tabular.io> wrote:

> Jack,
>
> I might be incorrect here, but I'll at least throw out some thoughts. If I
> understand correctly, the attacker requires access to modify some
> serialized object so that deserialization leads to arbitrary code
> execution. I think that the best way to protect against that is to avoid
> making it possible for an attacker to modify serialized bytes.
>
> To my knowledge, Java serialization is used in two places: first, to
> serialize objects between nodes, like sending a task to a Spark executor,
> and second, to serialize some persistent state in Flink. Iceberg does not
> use Java serialization for anything in the format or long-term storage. For
> the first case, I think that it is up to the distributed system passing
> objects between nodes to secure the content, like using TLS for connections
> between nodes. Since Java serialization is used by the processing engine,
> there isn't much Iceberg could do to change this and we have to rely on
> Spark or Flink.
>
> For the second issue, I think our use of Java serialization to store state
> is very limited, but we should take a look to make sure. I think this is
> one area where Iceberg made the choice to use Java serialization, so we
> should look into it and fix it if possible... although I'm not entirely
> sure how to avoid swapping out the state that gets loaded.
>
> Ryan
>
> On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> We use Java serialization and deserialization a lot in Iceberg. I wonder
>> if we have considered the potential of Java deserialization attack, where
>> an attacker can replace serialized bytes to execute arbitrary code through
>> the readObject method.
>>
>> Currently our SerializationUtil.deserializeFromBytes directly converts
>> bytes to an ObjectInputStream. I know Apache commons have
>> ValidatingObjectInputStream which can prevent the issue to some extent.
>>
>> Have we thought about this issue in the past? Are there any other
>> suggestions?
>>
>> Best,
>> Jack Ye
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: Java Deserialization Vulnerability

Posted by Ryan Blue <bl...@tabular.io>.
Jack,

I might be incorrect here, but I'll at least throw out some thoughts. If I
understand correctly, the attacker requires access to modify some
serialized object so that deserialization leads to arbitrary code
execution. I think that the best way to protect against that is to avoid
making it possible for an attacker to modify serialized bytes.

To my knowledge, Java serialization is used in two places: first, to
serialize objects between nodes, like sending a task to a Spark executor,
and second, to serialize some persistent state in Flink. Iceberg does not
use Java serialization for anything in the format or long-term storage. For
the first case, I think that it is up to the distributed system passing
objects between nodes to secure the content, like using TLS for connections
between nodes. Since Java serialization is used by the processing engine,
there isn't much Iceberg could do to change this and we have to rely on
Spark or Flink.

For the second issue, I think our use of Java serialization to store state
is very limited, but we should take a look to make sure. I think this is
one area where Iceberg made the choice to use Java serialization, so we
should look into it and fix it if possible... although I'm not entirely
sure how to avoid swapping out the state that gets loaded.

Ryan

On Sat, Jul 17, 2021 at 2:02 AM Jack Ye <ye...@gmail.com> wrote:

> Hi everyone,
>
> We use Java serialization and deserialization a lot in Iceberg. I wonder
> if we have considered the potential of Java deserialization attack, where
> an attacker can replace serialized bytes to execute arbitrary code through
> the readObject method.
>
> Currently our SerializationUtil.deserializeFromBytes directly converts
> bytes to an ObjectInputStream. I know Apache commons have
> ValidatingObjectInputStream which can prevent the issue to some extent.
>
> Have we thought about this issue in the past? Are there any other
> suggestions?
>
> Best,
> Jack Ye
>


-- 
Ryan Blue
Tabular