You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ozone.apache.org by Sumit Agrawal <su...@cloudera.com.INVALID> on 2022/11/29 06:20:05 UTC

OM to DN token verification should include Pipeline

Hi Devs,


   1. Related to HDDS-7454 <https://issues.apache.org/jira/browse/HDDS-7454>,
   need opinion if this requires handling or not, based on impact and
   complexity. Below is given brief and same is present in Jira.
   2.


Please share opinion ...

*For non-secure env* with raw/malicious client, below are cases

1) Writing to new DN will cause addition of container, can cause data loss
- Raised JIRA: HDDS-7552 <https://issues.apache.org/jira/browse/HDDS-7552>

    Will avoid writing / delete the container to the DN.

2) Writing new block to DN having container, causes additional blocks and
consuming space

    Impact: additional space consumption

    Note: no way to control in current design as OM and DN do not have any
sync, may need solution in future including Recon which can have OM, SMC
and DN information and mapping.

3) Writing with unknown container to DN causing addition of container -
Already handled using HDDS-3241
<https://issues.apache.org/jira/browse/HDDS-3241>



*For Secure env* as current bug, need opinion if required to be handled
based on impact,

   1. Authorization of pipeline / DNs: Currently its not present as part of
   this bug. Its suggested to be add as part of block token.



Pros:

   - Avoid writing to DN for which its is not intended, and avoid malicious
   impact of data loss, space consumption as shown above for non-secure env
   impact.

Cons:

   - Need have code for adding pipeline in token generation, passing and
   validation at DNs
   - Code will be complex, EC have different way of sync, inducing
   complexity and failure points

*Security Impact if this is not handled:*

   - SCM need validate new container using ICR which is Async, and need
   atleast 2 heart beat to notify DN to avoid writting (30+ seconds).
   - During this time, client can add a lot of block data during that time
   - Exploitation is easy, but client should be authorized to get block
   write permission



-- 
*Sumit Agrawal* | Senior Staff Engineer
cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------

Re: OM to DN token verification should include Pipeline

Posted by Sumit Agrawal <su...@cloudera.com.INVALID>.
Thanks Pifta,

1. Solution where SCM validates containers from DN on ICR will be added
that will resolve both secure and non-secure environment.

2. *Agree* that for secure env, pipeline validation *will not add
much value* (with above point handled) and *impact will be very low, AS:*
- primary write access is already validated using block token having
container and block info
- its very unlikely that client with having valid access will write to
different Datanode maliciously and these impact is controlled within time
limit of 2HB.

Considering this, I think it does not need add extra pipeline authorization
as impact is very low.

Regards
Sumit.


On Wed, Dec 7, 2022 at 4:52 AM István Fajth <fa...@gmail.com> wrote:

> Hi Sumit,
>
> sorry for getting back somewhat late on this, let me share my opinion here
> as well as I will do in the JIRA ticket shortly.
>
> As we discussed, the problem is that currently a rogue client can write
> blocks to DataNodes that are different from the Pipeline information that
> is provided for the client from Ozone Manager. This is true in secure and
> non-secure environments.
> As Neil mentioned this might compromise a container when SCM checks the
> replicas and figures out which are the over replicated container and if
> there are excess replicas which ones to delete, as if a rogue client writes
> a container to 3 nodes (even via STANDALONE replication type) and properly
> sync these writes bcsid associated with the container might go above the
> one in the good containers, and with that take over the precedence and make
> the old valid data to be removed potentially.
>
> As this can happen in a non secure environment, I strongly believe we
> should not touch the tokens as that does not solves the problem at all, as
> tokens are present only in a secured environment.
>
> I think the solution is within SCM, as if a DN does not have the container
> yet (it does not have a valid replica of the container), then at container
> creation an ICR is being triggered, and while that ICR is processed, that
> container should be marked as an invalid replica and SCM should issue a
> delete container to the DataNode reported the invalid container. (We should
> be able to determine that the container is invalid during ICR processing,
> as SCM should know which container belongs to which Pipeline and if the DN
> is not part of the Pipeline it should not report creation of a container
> with the specific container ID.)
> If possible Ozone Manager also should refuse the write and metadata update,
> based on information provided by SCM (either by caching the in flight write
> Pipelines and then the Pipelines reported by the client at the end of the
> write, or by directly checking the write location with SCM to validate the
> write).
>
> We should not include this information in the tokens I believe, as we don't
> gain anything with that, after implementing proper measures to deal with
> such rouge clients. Here is why: if the SCM instructs the DN within 2
> heartbeats to remove the rogue container, then rogue clients will have 2HB
> of time (1 min by default if no container creation happens in between the 2
> HB, but it happens... so less than 1 min) to occupy space from the cluster
> with garbage data, but in order to do that they need access permission the
> first time, and if they have access permissions, they can write garbage
> anyway to valid locations, so the only thing we need to prevent is messing
> up the container space and the OM metadata, and that is done with the
> proposed check in ICR and with the check at committing the write from the
> client to OM.
>
> Regards,
> Pifta
>
> Sumit Agrawal <su...@cloudera.com.invalid> ezt írta (időpont: 2022.
> nov. 29., K, 7:20):
>
> > Hi Devs,
> >
> >
> >    1. Related to HDDS-7454 <
> > https://issues.apache.org/jira/browse/HDDS-7454>,
> >    need opinion if this requires handling or not, based on impact and
> >    complexity. Below is given brief and same is present in Jira.
> >    2.
> >
> >
> > Please share opinion ...
> >
> > *For non-secure env* with raw/malicious client, below are cases
> >
> > 1) Writing to new DN will cause addition of container, can cause data
> loss
> > - Raised JIRA: HDDS-7552 <
> https://issues.apache.org/jira/browse/HDDS-7552>
> >
> >     Will avoid writing / delete the container to the DN.
> >
> > 2) Writing new block to DN having container, causes additional blocks and
> > consuming space
> >
> >     Impact: additional space consumption
> >
> >     Note: no way to control in current design as OM and DN do not have
> any
> > sync, may need solution in future including Recon which can have OM, SMC
> > and DN information and mapping.
> >
> > 3) Writing with unknown container to DN causing addition of container -
> > Already handled using HDDS-3241
> > <https://issues.apache.org/jira/browse/HDDS-3241>
> >
> >
> >
> > *For Secure env* as current bug, need opinion if required to be handled
> > based on impact,
> >
> >    1. Authorization of pipeline / DNs: Currently its not present as part
> of
> >    this bug. Its suggested to be add as part of block token.
> >
> >
> >
> > Pros:
> >
> >    - Avoid writing to DN for which its is not intended, and avoid
> malicious
> >    impact of data loss, space consumption as shown above for non-secure
> env
> >    impact.
> >
> > Cons:
> >
> >    - Need have code for adding pipeline in token generation, passing and
> >    validation at DNs
> >    - Code will be complex, EC have different way of sync, inducing
> >    complexity and failure points
> >
> > *Security Impact if this is not handled:*
> >
> >    - SCM need validate new container using ICR which is Async, and need
> >    atleast 2 heart beat to notify DN to avoid writting (30+ seconds).
> >    - During this time, client can add a lot of block data during that
> time
> >    - Exploitation is easy, but client should be authorized to get block
> >    write permission
> >
> >
> >
> > --
> > *Sumit Agrawal* | Senior Staff Engineer
> > cloudera.com <https://www.cloudera.com>
> > [image: Cloudera] <https://www.cloudera.com/>
> > [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> > Cloudera on Facebook] <https://www.facebook.com/cloudera> [image:
> Cloudera
> > on LinkedIn] <https://www.linkedin.com/company/cloudera>
> > ------------------------------
> >
>
>
> --
> Pifta
>


-- 
*Sumit Agrawal* | Senior Staff Engineer
cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------

Re: OM to DN token verification should include Pipeline

Posted by István Fajth <fa...@gmail.com>.
Hi Sumit,

sorry for getting back somewhat late on this, let me share my opinion here
as well as I will do in the JIRA ticket shortly.

As we discussed, the problem is that currently a rogue client can write
blocks to DataNodes that are different from the Pipeline information that
is provided for the client from Ozone Manager. This is true in secure and
non-secure environments.
As Neil mentioned this might compromise a container when SCM checks the
replicas and figures out which are the over replicated container and if
there are excess replicas which ones to delete, as if a rogue client writes
a container to 3 nodes (even via STANDALONE replication type) and properly
sync these writes bcsid associated with the container might go above the
one in the good containers, and with that take over the precedence and make
the old valid data to be removed potentially.

As this can happen in a non secure environment, I strongly believe we
should not touch the tokens as that does not solves the problem at all, as
tokens are present only in a secured environment.

I think the solution is within SCM, as if a DN does not have the container
yet (it does not have a valid replica of the container), then at container
creation an ICR is being triggered, and while that ICR is processed, that
container should be marked as an invalid replica and SCM should issue a
delete container to the DataNode reported the invalid container. (We should
be able to determine that the container is invalid during ICR processing,
as SCM should know which container belongs to which Pipeline and if the DN
is not part of the Pipeline it should not report creation of a container
with the specific container ID.)
If possible Ozone Manager also should refuse the write and metadata update,
based on information provided by SCM (either by caching the in flight write
Pipelines and then the Pipelines reported by the client at the end of the
write, or by directly checking the write location with SCM to validate the
write).

We should not include this information in the tokens I believe, as we don't
gain anything with that, after implementing proper measures to deal with
such rouge clients. Here is why: if the SCM instructs the DN within 2
heartbeats to remove the rogue container, then rogue clients will have 2HB
of time (1 min by default if no container creation happens in between the 2
HB, but it happens... so less than 1 min) to occupy space from the cluster
with garbage data, but in order to do that they need access permission the
first time, and if they have access permissions, they can write garbage
anyway to valid locations, so the only thing we need to prevent is messing
up the container space and the OM metadata, and that is done with the
proposed check in ICR and with the check at committing the write from the
client to OM.

Regards,
Pifta

Sumit Agrawal <su...@cloudera.com.invalid> ezt írta (időpont: 2022.
nov. 29., K, 7:20):

> Hi Devs,
>
>
>    1. Related to HDDS-7454 <
> https://issues.apache.org/jira/browse/HDDS-7454>,
>    need opinion if this requires handling or not, based on impact and
>    complexity. Below is given brief and same is present in Jira.
>    2.
>
>
> Please share opinion ...
>
> *For non-secure env* with raw/malicious client, below are cases
>
> 1) Writing to new DN will cause addition of container, can cause data loss
> - Raised JIRA: HDDS-7552 <https://issues.apache.org/jira/browse/HDDS-7552>
>
>     Will avoid writing / delete the container to the DN.
>
> 2) Writing new block to DN having container, causes additional blocks and
> consuming space
>
>     Impact: additional space consumption
>
>     Note: no way to control in current design as OM and DN do not have any
> sync, may need solution in future including Recon which can have OM, SMC
> and DN information and mapping.
>
> 3) Writing with unknown container to DN causing addition of container -
> Already handled using HDDS-3241
> <https://issues.apache.org/jira/browse/HDDS-3241>
>
>
>
> *For Secure env* as current bug, need opinion if required to be handled
> based on impact,
>
>    1. Authorization of pipeline / DNs: Currently its not present as part of
>    this bug. Its suggested to be add as part of block token.
>
>
>
> Pros:
>
>    - Avoid writing to DN for which its is not intended, and avoid malicious
>    impact of data loss, space consumption as shown above for non-secure env
>    impact.
>
> Cons:
>
>    - Need have code for adding pipeline in token generation, passing and
>    validation at DNs
>    - Code will be complex, EC have different way of sync, inducing
>    complexity and failure points
>
> *Security Impact if this is not handled:*
>
>    - SCM need validate new container using ICR which is Async, and need
>    atleast 2 heart beat to notify DN to avoid writting (30+ seconds).
>    - During this time, client can add a lot of block data during that time
>    - Exploitation is easy, but client should be authorized to get block
>    write permission
>
>
>
> --
> *Sumit Agrawal* | Senior Staff Engineer
> cloudera.com <https://www.cloudera.com>
> [image: Cloudera] <https://www.cloudera.com/>
> [image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
> Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
> on LinkedIn] <https://www.linkedin.com/company/cloudera>
> ------------------------------
>


-- 
Pifta