You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "dhruba borthakur (Created) (JIRA)" <ji...@apache.org> on 2011/11/05 19:22:53 UTC

[jira] [Created] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Ability to use SimpleRegeratingCode to fix missing blocks
---------------------------------------------------------

                 Key: MAPREDUCE-3361
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: contrib/raid
            Reporter: dhruba borthakur
            Assignee: dhruba borthakur


ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Chen updated MAPREDUCE-3361:
----------------------------------

    Assignee: Weiyan Wang  (was: dhruba borthakur)

Assign to Weiyan because he is working on this now.
                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: Weiyan Wang
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Ramkumar Vadali (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160969#comment-13160969 ] 

Ramkumar Vadali commented on MAPREDUCE-3361:
--------------------------------------------

It will also be nice to have this code be backwards compatible with existing Reed-Solomon parity files. If there is an existing Reed-Solomon parity file, the code can identify that by counting the number of parity blocks with the expected number of Reed-Solomon parity files. This is doable because the additional XOR parity blocks will increase the total number of parity blocks by a deterministic number. Thus this code will be able to handle existing Reed-Solomon parity files and will generate new files with additional XOR blocks.
                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Scott Chen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145682#comment-13145682 ] 

Scott Chen commented on MAPREDUCE-3361:
---------------------------------------

I think one thing we need to do is to refactor the raid.Encoder and raid.Decoder. So they are generic to any ErasureCode. That way we can easily add different codes to Raid.
                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Alex Dimakis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162247#comment-13162247 ] 

Alex Dimakis commented on MAPREDUCE-3361:
-----------------------------------------

After some discussion, it seems that the easiest method to ensure backwards compatibility is to recognize if this is a Reed-Solomon coded packet or an SRC coded packet by the size of the parity file. If we use stripe size 10, 4 RS packets and 2 simple XORs then the parity file should have size 64MB*6 while for RS coded files it should be 64MB*4. We will implement this distinction in Processfile and run different decoder functions accordingly. 

                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Maheswaran Sathiamoorthy (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171463#comment-13171463 ] 

Maheswaran Sathiamoorthy commented on MAPREDUCE-3361:
-----------------------------------------------------

There is another way of doing it:
I will add a new erasure code type called SRC to ErasureCodeType (which has XOR, RS now) and start storing SRC coded files in /raidsrc (RS files stored in /raidrs, XOR in /raid). When a file corruption is detected 
and recoverBlockToFile is called, the first thing to do is to check whether the file is a parity file or a source file. By looking at the location it can be easily determined whether this is a parity file and if so which type. Now if its not a parity file, then it is a source file and we need to determine its corresponding parity file. This can be done by checking for a parity file first in /raidsrc, and then in /raidrs and /raid to find out where it is located. That way we can find the parity file too. 
The same thing can be done by determining the filesize, for which we still need to search for the parity file by going to /raidrs or /raid; so I think the above approach is a little bit cleaner. 
For reconstructing the file, in either approach, we need to pass the ErasureCodeType all the way till the decoder and encoder. 
                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Scott Chen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145652#comment-13145652 ] 

Scott Chen commented on MAPREDUCE-3361:
---------------------------------------

Here is the paper of SimpleRegenerating Code.
http://arxiv.org/pdf/1109.0264
                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3361) Ability to use SimpleRegeratingCode to fix missing blocks

Posted by "Alex Dimakis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161300#comment-13161300 ] 

Alex Dimakis commented on MAPREDUCE-3361:
-----------------------------------------

I can see that backwards compatibility would be crucial for a deployed system. It is not always clear how to find if a parity block is a simple parity or an RS parity just by counting since the config files might have different number of simple parities (our default kept the total number of parities to 4 by having two RS and two 6 degree XORs) to keep the same storage overhead as a (14,10) Reed Solomon. 

I think a cleaner way to understand what each parity is, can be done through the meta data file or the folder it is in (right now how do you distinguish simple XOR to RS parities)?

                
> Ability to use SimpleRegeratingCode to fix missing blocks
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3361
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira