You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/01/21 07:01:29 UTC

[GitHub] [ozone] umamaheswararao opened a new pull request #3006: HDDS-6172: EC: Document the Ozone EC

umamaheswararao opened a new pull request #3006:
URL: https://github.com/apache/ozone/pull/3006


   ## What changes were proposed in this pull request?
   
   Updated the doc for EC
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-6172
   
   ## How was this patch tested?
   
   NA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1021612667


   Thank you @umamaheswararao +1 on committing it, after we discussed with @JyotinderSingh the hugo vs simple markup of images as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791393403



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789566482



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       The existing approach works fine for larger screens but can cause the image to overflow out of the screen when viewing the site on smaller screens since it does not scale down the image automatically. You can try to reproduce the behavior by changing the viewport size by activating developer tools in chrome.
   
   Like in the image below, the image goes out to the right while the text is pushed to the left.
   
   <img width="448" alt="Screenshot 2022-01-21 at 5 10 33 PM" src="https://user-images.githubusercontent.com/33001894/150521291-f59fa6e0-11b3-4a0c-b938-6b8c533f3889.png">
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791393468



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789444758



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       Looks like many of the ozone docs followed the above approach. (ex: OM-HA.md file image refs used` ![Double buffer](HA-OM-doublebuffer.png))` I already ran mvn site and it's generated fine to me.
   Is there an issue with existing way to referencing? ( This is working for both intellij and site to me)
   Please check this [mvn site generated file screenshot ](https://issues.apache.org/jira/secure/attachment/13039196/mvn-site-Ozone-EC-doc-screenshot.png)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791345807



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       Hi @JyotinderSingh , Thank you for providing details.
   Not sure users will really view these docs in smaller screens. Dropping support to view in github and intellij is concerning me where we could allow reviewer view the images easily and check whether they embed into docs properly( without needing to run site). 
   I kept all images size 5 inch range size.
   
   If no one else has any concern, then I will go update it. Thanks for checking the docs and providing your comments.

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.

Review comment:
       done

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.

Review comment:
       Done

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out

Review comment:
       done

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.

Review comment:
       I intentionally left that because, we need to test 10-4 in real clusters. Recently we saw overflow issues in HDFS which will trigger mainly with higher data+parity sizes. So, Until we test thoroughly 10-4, let's skip that in docs. 

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.
+The most recommended option is `RS-6-3-1024k`. When a key/file created without specifying the replication config,
+it inherits the EC replication config of its bucket if it's available.
+
+Changing the bucket level EC config only a affect new files create within the bucket.

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791393277



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789418341



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)

Review comment:
       Could you use change the Markdown Image tag to the Hugo shortcode defined in `hadoop-hdds/docs/themes/ozonedoc/layouts/shortcodes/image.html`
   This would ensure that the images don't overflow content boundaries on the website.
   
   You can change it to the following:
   ```
   {{< image src="EC-Write-Block-Allocation-in-Containers.png">}}
   ```
   
   Once you do this, the image won't be visible on the intellij markdown editor, but will be available on the website after the Hugo build process.
   You can preview the website as it will appear on the web by running the following ([reference](https://github.com/apache/ozone/tree/master/hadoop-hdds/docs)):
   ```
   hugo serve
   ```
   note: you will need to have hugo available on your machine to run the above command `brew install hugo`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1020685472


   Hi @umamaheswararao,
   
   thank you for writing the documentation parts for the EC feature. I have added a couple of inline comments mainly for spelling issues, or where I have not understood well the sentence for the first read. (It might be because of my non-native english  skills, so I might not be right everywhere).
   
   In general I would like to ask you to proof read the text one more time and please take care of some inconsistencies in writing different names. What I found inconsistent is the mixing of lower/uppercase forms like:
   ec vs Ec vs EC
   erasure coding vs Erasure coding vs Erasure Coding
   replication config vs Replication config vs Replication configuration mixed with an ec prefix sometimes
   
   At some points while I was reading I really missed an article in front of some words, and sometimes I felt the one I see is not really necessary. Again this can be my non-nativeness, and you might be perfectly right with the usage or lack of the article, hence when you read the text again, please consider this, and if articles are really missed or if they are not needed then please fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791254483



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.

Review comment:
       nit: beyond teh size -> beyond the size

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.

Review comment:
       nit: shouldn't this sentence start with something like one of these? :
   - The XceiverGrpc client is used for...
   - The XceiverClientGrpc client implementation is used for...
   - The gRPC Datanode client is used for...
   
   Also the next sentence for me is a bit hard to understand, but that might be just my english, so please just take a look again, besides this, it also contains a typo, "datanode sides changes" -> side

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out

Review comment:
       nits:
   I would write something like this: "If the key is erasure coded, Ozone client reads it in EC fashion."
   "lay out" -> layout
   () -> (see the previous section about the write path)
   and do the reads. -> and do the reads accordingly.

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,

Review comment:
       nit: when all locations are -> when all data locations are
   
   if I am right, and the reads does not try to connect for parity blocks just in case an online recovery is required, if this is not true, then I am wrong here ;)

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.

Review comment:
       Aren't we supporting RS-10-4-1024k?

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.
+The most recommended option is `RS-6-3-1024k`. When a key/file created without specifying the replication config,
+it inherits the EC replication config of its bucket if it's available.
+
+Changing the bucket level EC config only a affect new files create within the bucket.

Review comment:
       only a affect ->only affect
   create -> created (this change I believe also should be done at the beginning of the next line)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791948929



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       I think @umamaheswararao is right here, our doc uses this notation where I quickly checked, so I would suggest to stick to this notation for now.
   
   If we want to solve the small screen problem, let's do it in a separate JIRA, I would do it differently though, as it is way more comfortable to review these files in IDE, and over Github if images are shown, so we may change the notation during site build, and than use hugo to work on the md files modified during the first step. With that we can have the advantages of both notations where we need it.
   
   @JyotinderSingh what do you think? Would that be a feasible way to solve this problem?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789419327



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)

Review comment:
       Could you change this to use the Hugo shortcode:
   ```
   {{< image src="EC-Reads-With-No-Failures.png">}}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789418341



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)

Review comment:
       Could you use change the Markdown Image tag to the Hugo shortcode defined in `hadoop-hdds/docs/themes/ozonedoc/layouts/shortcodes/image.html`
   This would ensure that the images don't overflow content boundaries on the website.
   
   You can change it to the following:
   ```
   {{< image src="EC-Write-Block-Allocation-in-Containers.png">}}
   ```
   
   Once you do this, the image won't be visible on the intellij markdown editor, but will be available on the website after the Hugo build process.
   You can preview the website as it will appear on the web by running the following ([reference](https://github.com/apache/ozone/tree/master/hadoop-hdds/docs)):
   ```
   hugo serve
   ```
   note: you will need to have hugo available on your machine `brew install hugo`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789419136



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)

Review comment:
       Could you change this to use the Hugo shortcode:
   ```
   {{< image src="EC-Chunk-Layout.png">}}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789566482



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       The existing approach works fine for larger screens but can cause the image to overflow out of the screen when viewing the site on smaller screens since it does not scale down the image automatically. You can try to reproduce the behavior by changing the viewport size by activating developer tools in chrome.
   
   Like in the image below, the goes out to the right while the text is pushed to the left.
   
   <img width="448" alt="Screenshot 2022-01-21 at 5 10 33 PM" src="https://user-images.githubusercontent.com/33001894/150521291-f59fa6e0-11b3-4a0c-b938-6b8c533f3889.png">
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791345807



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       Hi @JyotinderSingh , Thank you for providing details.
   Not sure users will really view these docs in smaller screens. Dropping support to view in github and intellij is concerning me where we could allow reviewer view the images easily and check whether they embed into docs properly( without needing to run site). 
   I kept all images size 5 inch range size.
   
   If no one else has any concern, then I will go update it. Thanks for checking the docs and providing your comments.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791254680



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.

Review comment:
       Aren't we supporting RS-10-4-1024k?

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.
+The most recommended option is `RS-6-3-1024k`. When a key/file created without specifying the replication config,
+it inherits the EC replication config of its bucket if it's available.
+
+Changing the bucket level EC config only a affect new files create within the bucket.

Review comment:
       only a affect ->only affect
   create -> created (this change I believe also should be done at the beginning of the next line)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789566482



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       The existing approach works fine for larger screens but can cause the image to overflow out of the screen (and introduce horizontal scroll) when viewing the site on smaller screens since it does not scale down the image automatically. You can try to reproduce the behavior by going changing the viewport size by activating developer tools in chrome.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r792051510



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       Thanks @fapifta for chime in and providing your views.
   @JyotinderSingh, very cool. Let's file separate JIRA and investigate how we could replace while building hugo. Thanks for offering the help here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791394920



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.
+The most recommended option is `RS-6-3-1024k`. When a key/file created without specifying the replication config,
+it inherits the EC replication config of its bucket if it's available.
+
+Changing the bucket level EC config only a affect new files create within the bucket.

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1020685472


   Hi @umamaheswararao,
   
   thank you for writing the documentation parts for the EC feature. I have added a couple of inline comments mainly for spelling issues, or where I have not understood well the sentence for the first read. (It might be because of my non-native english  skills, so I might not be right everywhere).
   
   In general I would like to ask you to proof read the text one more time and please take care of some inconsistencies in writing different names. What I found inconsistent is the mixing of lower/uppercase forms like:
   ec vs Ec vs EC
   erasure coding vs Erasure coding vs Erasure Coding
   replication config vs Replication config vs Replication configuration mixed with an ec prefix sometimes
   
   At some points while I was reading I really missed an article in front of some words, and sometimes I felt the one I see is not really necessary. Again this can be my non-nativeness, and you might be perfectly right with the usage or lack of the article, hence when you read the text again, please consider this, and if articles are really missed or if they are not needed then please fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1021540299


   @fapifta I totally agree that there are lot of places I used different formats as you mentioned. I did not really focused on that EC and Erasure Coding or EC Replication Config or EC replication configuration(I meant to say it's erasure coding configuration, but indirectly it's bringing inconsistencies for readers probably) is creating some in consistencies. As a review reader, I feel your views are correct than my biased view :-)
   I changed most of the placed to be in consistent now. I still used Replication Config in place to tell that an interface.
   
   Thanks for the reviews. Yeah we can continue improve as we need few mores updates to this docs as we may need to update few other configs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] fapifta commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

fapifta commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791254483



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.

Review comment:
       nit: beyond teh size -> beyond the size

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.

Review comment:
       nit: shouldn't this sentence start with something like one of these? :
   - The XceiverGrpc client is used for...
   - The XceiverClientGrpc client implementation is used for...
   - The gRPC Datanode client is used for...
   
   Also the next sentence for me is a bit hard to understand, but that might be just my english, so please just take a look again, besides this, it also contains a typo, "datanode sides changes" -> side

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out

Review comment:
       nits:
   I would write something like this: "If the key is erasure coded, Ozone client reads it in EC fashion."
   "lay out" -> layout
   () -> (see the previous section about the write path)
   and do the reads. -> and do the reads accordingly.

##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,

Review comment:
       nit: when all locations are -> when all data locations are
   
   if I am right, and the reads does not try to connect for parity blocks just in case an online recovery is required, if this is not true, then I am wrong here ;)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789418341



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)

Review comment:
       Could you use change the Markdown Image tag to the Hugo shortcode defined in `hadoop-hdds/docs/themes/ozonedoc/layouts/shortcodes/image.html`
   This would ensure that the images don't overflow content boundaries on the website.
   
   You can change it to the following:
   ```
   {{< image src="EC-Write-Block-Allocation-in-Containers.png">}}
   ```
   
   Once you do this, the image won't be visible on the intellij markdown editor, but will be available on the website after the Hugo build process.
   You can preview the website as it will appear on the web by running the following ([reference](https://github.com/apache/ozone/tree/master/hadoop-hdds/docs)):
   ```
   hugo serve
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789419737



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       Could you change this to use the Hugo shortcode:
   ```
   {{< image src="EC-Reconstructional-Read.png">}}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789566482



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       The existing approach works fine for larger screens but can cause the image to overflow out of the screen (and introduce horizontal scroll) when viewing the site on smaller screens since it does not scale down the image automatically. You can try to reproduce the behavior by changing the viewport size by activating developer tools in chrome.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r792041220



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       I agree with this reasoning @umamaheswararao @fapifta.
   Ease of reviewing is definitely a benefit and @fapifta's suggestion of changing it during build time looks like a good solution to this problem.
   I'll take this forward as another jira.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r791394502



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+ ### Erasure Coding Replication Config
+
+ Apache Ozone build with pure 'Object Storage' semantics. However, many big data
+ eco system projects still uses file system APIs. To provide both worlds best access to Ozone,
+ it's provided both faces of interfaces. In both cases, keys/files would be written into buckets under the hood.
+ So, Erasure coding replication configurations can be set at bucket level.
+ The erasure coding policy encapsulates how to encode/decode a file.
+ Each replication config is defined by the following pieces of information: 
+  1. **data:** Data blocks number in an EC block group.
+  2. **parity:** Parity blocks number in an EC block group.
+  3. **ecChunkSize:** The size of a striping chunk. This determines the granularity of striped reads and writes.
+  4. **codec:** This is to indicate the type of erasure coding algorithms (e.g., `RS`(Reed-Solomon), `XOR`).
+
+To pass the EC replication config in command line or configuration files, we need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*ec chunk size*
+
+Currently, there are three built-in ec configs supported: `RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.

Review comment:
       I intentionally left that because, we need to test 10-4 in real clusters. Recently we saw overflow issues in HDFS which will trigger mainly with higher data+parity sizes. So, Until we test thoroughly 10-4, let's skip that in docs. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao merged pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao merged pull request #3006:
URL: https://github.com/apache/ozone/pull/3006


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao edited a comment on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao edited a comment on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1021540299


   @fapifta I totally agree that there are lot of places I used different formats as you mentioned. I did not really focused on that EC and Erasure Coding or EC Replication Config or EC replication configuration(I meant to say it's erasure coding configuration, but indirectly it's bringing inconsistencies for readers probably) is creating some inconsistencies. As a review reader, I feel your views are correct than my biased view :-)
   I changed most of the places to be in consistent now. I still used Replication Config in place to tell that an interface.
   
   Thanks for the reviews. Yeah we can continue improve as we need few mores updates to this docs as we may need to update few other configs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1018235645


   Updated the initial version of the doc. Please review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#issuecomment-1018257469


   Thank you for the docs @umamaheswararao! I have added a few minor formatting-related comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789418341



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)

Review comment:
       Could you use change the Markdown Image tag to the Hugo shortcode defined in `hadoop-hdds/docs/themes/ozonedoc/layouts/shortcodes/image.html`
   This would ensure that the images don't overflow content boundaries on the website.
   
   You can change it to the following:
   ```
   {{< image src="EC-Write-Block-Allocation-in-Containers.png">}}
   ```
   
   Once you do this, the image won't be visible on the intellij markdown editor, but will be available on the website after the Hugo build process.
   You can preview the website as it will appear on the web by running the following ([reference](https://github.com/apache/ozone/tree/master/hadoop-hdds/docs)):
   ```
   hugo serve
   ```
   note: you will need to have hugo available on your machine to run the serve command `brew install hugo`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] JyotinderSingh commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

JyotinderSingh commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789418341



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)

Review comment:
       Could you change the Markdown Image tag to the Hugo shortcode defined in `hadoop-hdds/docs/themes/ozonedoc/layouts/shortcodes/image.html`
   This would ensure that the images don't overflow content boundaries on the website.
   
   You can change it to the following:
   ```
   {{< image src="EC-Write-Block-Allocation-in-Containers.png">}}
   ```
   
   Once you do this, the image won't be visible on the intellij markdown editor, but will be available on the website after the Hugo build process.
   You can preview the website as it will appear on the web by running the following ([reference](https://github.com/apache/ozone/tree/master/hadoop-hdds/docs)):
   ```
   hugo serve
   ```
   note: you will need to have hugo available on your machine to run the serve command `brew install hugo`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a change in pull request #3006: HDDS-6172: EC: Document the Ozone EC

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on a change in pull request #3006:
URL: https://github.com/apache/ozone/pull/3006#discussion_r789444758



##########
File path: hadoop-hdds/docs/content/feature/ErasureCoding.md
##########
@@ -0,0 +1,215 @@
+---
+title: "Ozone Erasure Coding"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: Erasure Coding Support for Ozone.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, additional
+block replicas rarely accessed during normal operations, but still consume the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
+The replication factor of an EC file is meaningless. Instead of replication factor,
+we introduced ReplicationConfig interface to specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After deep analysis
+and several technical consideration, the most fitting data layout is striping model.
+The data striping layout is not new. The striping model already adapted by several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together called as "BlockGroup".
+
+If the application is continuing to write beyond teh size of `6 * BLOCK_SIZE`, then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at ozone client.
+When client creates the file, ozone manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of chunks
+to d number of data nodes in block group. It will also cache the d number chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer the
+same to the remaining p nodes in order. Once all blocks reached their configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) replication config for the diagram.
+
+![EC Block Allocation in Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed in the following picture.
+This picture shows how the chunks will be layed out in data node blocks.
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently ec client re-used the data transfer end-points to transfer the data to data nodes.
+That is XceiverGRPC client, used for writing data and for sending putBlock info.
+Since it used the existing transfer protocols while transferring the data, design got big advantage that,
+datanode sides changes are very minimal. The data block at data nodes would be written
+same as any other block in non-ec mode. In a single block group, container id numbers
+are same in all nodes. A file can have multiple block groups. Each block group will
+have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is in EC, Ozone client will do the reads in EC fashion. Since the data lay out
+is different(Previous section discussed the layout), reads should consider the layout and do the reads. 
+
+EC client will open the connections to DNs based on the expected locations. When all locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d data blocks.
+
+Below picture shows the order when there are no failures while reading.
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing the ec decoding.
+To do the ec decoding it needs to read parity replicas. This is a degraded read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)

Review comment:
       Looks like many of the ozone docs followed the above approach. I already ran mvn site and it's generated fine to me.
   Is there an issue with existing way to referencing? ( This is working for both intellij and site to me)
   Please check this [mvn site generated file screenshot ](https://issues.apache.org/jira/secure/attachment/13039196/mvn-site-Ozone-EC-doc-screenshot.png)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org