You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Chaitanya M V S <ch...@gmail.com> on 2018/06/11 16:51:38 UTC

Regarding Hadoop Erasure Coding architecture

Hi!

We a group of people trying to understand the architecture of erasure
coding in Hadoop 3.0. We have been facing difficulties to understand few
terms and concepts regarding the same.

1. What do the terms Block, Block Group, Stripe, Cell and Chunk mean in the
context of erasure coding (these terms have taken different meanings and
have been used interchangeably over various documentation and blogs)? How
has this been incorporated in reading and writing of EC data?

2. How has been the idea/concept of the block from previous versions
carried over to EC?

3. ‎The higher level APIs, that of ErasureCoders and ErasureCodec still
hasn't been plugged into Hadoop. Also, I haven't found any new Jira
regarding the same. Can I know if there are any updates or pointers
regarding the incorporation of these APIs into Hadoop?

4. How is the datanode for reconstruction work chosen?  Also, how are the
buffer sizes for the reconstruction work determined?


Thanks in advance for your time and considerations.

Regards,
M.V.S.Chaitanya

Re: Regarding Hadoop Erasure Coding architecture

Posted by Xiao Chen <xi...@cloudera.com.INVALID>.

Hi M.V.S.Chaitanya,

Thanks for the interest!

In case you didn't find it, upstream doc
<http://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html>
has
the definition. This blog post
<https://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/>
may
also help clarify things a bit.

Some answers inline.

Best,
-Xiao


On Mon, Jun 11, 2018 at 9:52 AM Chaitanya M V S <ch...@gmail.com>
wrote:

> Hi!
>
> We a group of people trying to understand the architecture of erasure
> coding in Hadoop 3.0. We have been facing difficulties to understand few
> terms and concepts regarding the same.
>
> 1. What do the terms Block, Block Group, Stripe, Cell and Chunk mean in the
> context of erasure coding (these terms have taken different meanings and
> have been used interchangeably over various documentation and blogs)? How
> has this been incorporated in reading and writing of EC data?

Checking the source code is probably the best way to get answers like how
the r/w of EC is done.

>
> 2. How has been the idea/concept of the block from previous versions
> carried over to EC?
>
Block is still largely the actual file on a datanode. In EC, a block group
contains several (9, in case of RS(6,3) ) blocks.

>
> 3. ‎The higher level APIs, that of ErasureCoders and ErasureCodec still
> hasn't been plugged into Hadoop. Also, I haven't found any new Jira
> regarding the same. Can I know if there are any updates or pointers
> regarding the incorporation of these APIs into Hadoop?
>
Not sure I understand what APIs are being referred here. A sample pointer
to hadoop implementation is
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/erasurecode/codec/ErasureCodec.java,
more can be looked up. :)

>
> 4. How is the datanode for reconstruction work chosen?  Also, how are the
> buffer sizes for the reconstruction work determined?
>
Suggest to look at source code in NN, specifically the BlockManager class.

>
>
> Thanks in advance for your time and considerations.
>
> Regards,
> M.V.S.Chaitanya
>