You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/01/25 00:15:30 UTC

[GitHub] [ozone] fapifta commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

fapifta commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791254483



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.

Review comment:
       nit: beyond teh size -> beyond the size

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.

Review comment:
       nit: shouldn't this sentence start with something like one of these? :
   - The XceiverGrpc client is used for...
   - The XceiverClientGrpc client implementation is used for...
   - The gRPC Datanode client is used for...
   
   Also the next sentence for me is a bit hard to understand, but that might be just my english, so please just take a look again, besides this, it also contains a typo, "datanode sides changes" -> side

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out

Review comment:
       nits:
   I would write something like this: "If the key is erasure coded, Ozone client reads it in EC fashion."
   "lay out" -> layout
   () -> (see the previous section about the write path)
   and do the reads. -> and do the reads accordingly.

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,

Review comment:
       nit: when all locations are -> when all data locations are
   
   if I am right, and the reads does not try to connect for parity blocks just in case an online recovery is required, if this is not true, then I am wrong here ;)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org