You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Ivan Kelly (JIRA)" <ji...@apache.org> on 2012/05/09 17:15:48 UTC

[jira] [Created] (BOOKKEEPER-248) Rereplicating of under replicated data

Ivan Kelly created BOOKKEEPER-248:
-------------------------------------

             Summary: Rereplicating of under replicated data
                 Key: BOOKKEEPER-248
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
             Project: Bookkeeper
          Issue Type: Sub-task
            Reporter: Ivan Kelly




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293696#comment-13293696 ] 

Ivan Kelly commented on BOOKKEEPER-248:
---------------------------------------

>From discussions on other JIRAs, the code here should implement a recovery worker thread. The thread loop should be something like.

{code}
while (true) {
    l = selectLedgerToRecover();
    if (l != null) {
        List<LedgerFragment> fragments = LedgerChecker.checkLedger(l);
        for (LedgerFragment lf : fragments) {
            rereplicateFragment(lf);
        }
    }
    waitForUnderreplicatedLedgers();
}
{code}
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated BOOKKEEPER-248:
-------------------------------------------

    Attachment: BOOKKEEPER-248.patch
    
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438651#comment-13438651 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Thanks a lot Ivan for the review!

{quote}
In ReplicationWorker#run, don't catch a Throwable. This will catch exceptions which you really don't want to catch, specifically, RuntimeExceptions like NullPointerException or ArithmeticException which are a result of programming error. 
{quote}
Yes, Ivan, I added TODO inside throwable to keep this point in mind.
I will try to catch all exceptions specifically now and will discuss after that.

{quote}
The flow is a bit strange between #run() and #doReplicateFragments(). You're interacting with underreplicationManager in both, depending on booleans etc. I think it would be cleaer to put all the interaction with underreplicationManager in run(). The boolean return from doReplicateFragments seems to be designed especially for this.
{quote}
Infact, I just extracted that to doReplicateFragments method from run method only :-).
That is mainly because, keeping pending replicationsMonitor in my mind.
After delaying for some ledger replication, and after timeout I wanted to reuse this method from there to replicate.

Anyway, Now I will make it inline, once I moved to that JIRA, we will discuss there.

Others I will take a look to address.

Thanks a lot.

+Uma

                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439794#comment-13439794 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Attached a new revision of patch which addresses the comments.

Found 2 issues while testing. One I explained in my previous comment and other issue with GC (raised as a separate JIRA BK-376)
And regarding the below comment
{quote}
ReplicationWorker#stop() should wait for #run() to have finished before cleaning up the bk client. Otherwise you're asking for null pointer exceptions to happen. Have a look guava Service [1]. It may be worth using
{quote}
Actually I am initialing the required stuff in RW ctor only and all were light wait. I did not see the possibility of null pointer there.
Infact, introducing the guava Service would be a good idea and with my limited(used) familiarity on Guava service and time, I could not really introduce it now. If you don't mind I will replace it once I get some familiarity on it and will raise small task later.

Could you please take a look on the patch.



                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396648#comment-13396648 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Open question:
 How the ReplicationWorker can know about the ledger DigestType and password while reading the entries as part of replication?

                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Kelly updated BOOKKEEPER-248:
----------------------------------

    Description: This subtask discusses how we will rereplicate underreplicated entries.
    
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G reassigned BOOKKEEPER-248:
----------------------------------------------

    Assignee: Uma Maheswara Rao G
    
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293699#comment-13293699 ] 

Ivan Kelly commented on BOOKKEEPER-248:
---------------------------------------

selecting the ledger to recover should lock the znode, so that other worker do not try to use it. once finished, the underreplicated znode for the ledger should be deleted.
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440178#comment-13440178 ] 

Ivan Kelly commented on BOOKKEEPER-248:
---------------------------------------

For completeness could you add a sync BookKeeperAdmin#openLedger. Should be almost a copy/paste. Otherwise patch is good.

Regarding Service, im fine with leaving that until later. We also need to think about how we run the threads. For example, it would be nice for the threads to be able to restart if they crash, but only after a delay and only for a limit number of times (to stop a persistent failure filling everything with logs). I think hbase does something like this now, so will look and see.
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated BOOKKEEPER-248:
-------------------------------------------

    Attachment: BOOKKEEPER-248.patch
    
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated BOOKKEEPER-248:
-------------------------------------------

    Attachment: BOOKKEEPER-248.patch
    
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440225#comment-13440225 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Addressed the comments from Ivan.
@Ivan, could you please take a look. Also Raised separate JIRA(BK-378) and put the patch for above discussed case, where multiple Workers test can block due to watcher issue in ZkLedgerUnderreplicationManager.
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440168#comment-13440168 ] 

Ivan Kelly commented on BOOKKEEPER-248:
---------------------------------------

{quote}
Shall I handle that bug along with this JIRA?
{quote}
This should be in another JIRA. I think the fix is quite simple, as you say (add watcher again by calling exists again after NodeExistsException).

The NPE I was speaking of may not actually exist, but the way it is implemented it looks very possible. In #stop you close the underreplLM and bkc. This uninitializes both these objects. However, to stop #run, you only set workerRunning to false, which allows #run to keep going until the next iteration of the while loop, possibly using the underreplLM and bkc which you have just uninitialized. So any issue may not be a NPE exactly, but it could well be some problem of the same type. Really, what you need to do here, is have #stop() block until #run() has finished, and then cleanly the bkc etc. A countdownlatch would do it. Or alternatively, ReplicationWorker could own the Thread object which is being used to run it [1], and then #stop() could join() the thread to wait for it to finish.

[1] Don't start the thread from the ctor, have a explicit start() method. Starting from the ctor makes unit testing a pain, and causes findbugs issues.


                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396657#comment-13396657 ] 

Ivan Kelly commented on BOOKKEEPER-248:
---------------------------------------

{quote}
How the ReplicationWorker can know about the ledger DigestType and password while reading the entries as part of replication?
{quote}
BOOKKEEPER-2. I have most of the code for this done, but I need to write tests. 
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396693#comment-13396693 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Ok, thanks Ivan. Will make use of BK-2 metadata info once it is in.
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396038#comment-13396038 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

I have worked on this patch.

ReplicationWorker is a thread which will start the UnderreplicateWatcher for getting the underreplicated ledgers and populate into underreplicatedLedgersQ.

ReplicationWorker loop will pool for the elements from this Q. Once it picks the ledger from this Q, it will get the fragments from LedgerChecker. After that it will try to get the lock. On suceess, it will replicate that fragments using LedgerFragmentReplicator (updated in BOOKKEEPER-299). Again On successfull replication, it will clear the lock and delete the underreplicatedledger node. 

We have choosen current node as target node. If the current node is part of the fragment ensemble, then it will just skip and release lock for giving chance to other bookies to copy that fragments. Will proceed for picking other ledger.

Will upload Distributed lock patch and Replication worker patches separately...
 
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438597#comment-13438597 ] 

Ivan Kelly commented on BOOKKEEPER-248:
---------------------------------------

@Uma I had a look. Shape is good. Comments are as follows.

The new BookKeeperAdmin constructor should only take a BookKeeper. The zookeeper client and bookies available path can be extracted from this.

In ReplicationWorker#run, don't catch a Throwable. This will catch exceptions which you really don't want to catch, specifically, RuntimeExceptions like NullPointerException or ArithmeticException which are a result of programming error. 

In ReplicationWorker#run, when you do catch an exception, you should return from the run method. You don't want the while loop to run again. (you should also call #stop() to cleanup).

ReplicationWorker#stop() should wait for #run() to have finished before cleaning up the bk client. Otherwise you're asking for null pointer exceptions to happen. Have a look guava Service [1]. It may be worth using.

A quick note on #doReplicateFragments would be good to explain what the boolean return value is.

The flow is a bit strange between #run() and #doReplicateFragments(). You're interacting with underreplicationManager in both, depending on booleans etc. I think it would be cleaer to put all the interaction with underreplicationManager in run(). The boolean return from doReplicateFragments seems to be designed especially for this.

Im not sure if I like the fact that you only replicate to a single bookie from an individual replication worker. I can see why you did it (so you'll only replicate to local), but it seems to me as if it could cause problems later, though I can't pin any specific reason right now.

Why not implement a sync openLedgerNoRecovery on BookKeeperAdmin rather than in #getLedgerHandle?

Also, this needs more tests. Such as multiple running workers and multiple bookie failures and ledger failures etc.

[1] http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/util/concurrent/Service.html

                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440186#comment-13440186 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Thanks a lot Ivan, I will add BookKeeperAdmin#openLedger in next revision in few minutes.


{quote}
Regarding Service, im fine with leaving that until later. We also need to think about how we run the threads. For example, it would be nice for the threads to be able to restart if they crash, but only after a delay and only for a limit number of times (to stop a persistent failure filling everything with logs). I think hbase does something like this now, so will look and see.
{quote}
Thanks. I also will invest some time on that area.
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437599#comment-13437599 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

Attached an initial patch, need to add some more tests and recheck some of the TODOs in it.
Ready for approach review!
                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated BOOKKEEPER-248:
-------------------------------------------

    Component/s:     (was: bookkeeper-server)
                     (was: bookkeeper-client)
                 bookkeeper-auto-recovery
    
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch, BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-248) Rereplicating of under replicated data

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439426#comment-13439426 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-248:
------------------------------------------------

While writing multiple recovery workers test, I found one issue from ZKLedgerUnderreplicationManager#getLedgerToRereplicateFromHierarchy API.

Issue is: 
1) Two Workers started and trying to get the lock for same ledger.
2) Both worker found that lock file does not exist.
3) both gone ahead for creating the lock node.
4) One worker failed with NodeExists exception

Then it is just removing the children from the list and go for latch wait for the watch notification.

But here unfortunately we added the watch on lockPath with exists check call. But that time lockPatch really did not exists. SO, the lock may be invalid. Then it will never get the notification when lock has been cleaned by other worker.
Here other worker partly replicated and now the current worker should take lock. But it can not get that notification as it added that watch when node does not exist.

Shall I handle that bug along with this JIRA?

Possible solution could be that, we have to add the watcher again 
 on KeeperException.NodeExistsException right?
or simply we can handle NodeCreated also in watcher and notify, let it try again(Did not think of many scenarios with this option)?


                
> Rereplicating of under replicated data
> --------------------------------------
>
>                 Key: BOOKKEEPER-248
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-248
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-client, bookkeeper-server
>            Reporter: Ivan Kelly
>            Assignee: Uma Maheswara Rao G
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-248.patch
>
>
> This subtask discusses how we will rereplicate underreplicated entries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira