You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2010/02/12 20:58:28 UTC

[jira] Created: (HBASE-2223) Handle 10min+ network partitions between clusters

Handle 10min+ network partitions between clusters
-------------------------------------------------

                 Key: HBASE-2223
                 URL: https://issues.apache.org/jira/browse/HBASE-2223
             Project: Hadoop HBase
          Issue Type: Sub-task
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833161#action_12833161 ] 

Jean-Daniel Cryans commented on HBASE-2223:
-------------------------------------------

That's my thinking too, so probably that the first version of this tool will just help the hbase administrator make the right choices.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833227#action_12833227 ] 

ryan rawson commented on HBASE-2223:
------------------------------------

so if we lose a replication stream, it is an uneven break, at this point we cant say 'we are missing all edits from TS=X to TS=Y' and kick off a map-reduce job to read them over.  

The central question is, do we want to avoid duplicate KeyValues as much as possible?  I say yes, because it messes with the version checking and is in general sloppy.  

Also edits dont pile up that quickly on mainline serving systems... so in reality we arent talking about a 50TB log storage requirement. 

We should probably be tracking the status of all logfiles in zookeeper so we know who needs what and when.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833193#action_12833193 ] 

ryan rawson commented on HBASE-2223:
------------------------------------

it might be better not to have a mixed view of the world during the catch up period, this could cause application problems if they need to assume a single arrow of time of edits, and not wanting to see a partial world view.

With multiPut, catching up should be quite speedy...

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833209#action_12833209 ] 

Jean-Daniel Cryans commented on HBASE-2223:
-------------------------------------------

{quote}
the previous plan was to NOT delete the log files if replication needed them... Sounds like you changed that?
If replication needs a log file and the server crashed and they are set to be split, we need to NOT delete the logfile, move them to a holding area perhaps and then get someone to pick the up and send them along.
{quote}

That's HBASE-2070.

{quote}
A 2 hour outage isnt that much, I'd say that we should buffer logs until someone decides it's not worth the disk space. Ie: make it a top level admin action/alert and give the administrator an option to drop the log retention and then do alternative catch up later.

It would be better to hold on to a few TB of replication logs then replay that after 24-48 hours of downtime than to mess with the map-reduce stuff, since you'd have to be careful to hopefully avoid duplicating keyvalues.
{quote}

Yes, that's how I see we could do it if we don't do the MR path.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835532#action_12835532 ] 

Jean-Daniel Cryans commented on HBASE-2223:
-------------------------------------------

Some design notes:

We need another class to manage multiple ReplicationSources (going to many slave clusters) between ReplicationHLog and ReplicationSrouce, let's call it ReplicationSourceManager (RSM) for the moment. That class should be responsible to take actions and keep tabs for each outbound stream. When a source successfully sent a batch of edits to a peer, it should report the latest HLogKey to the RSM so that we match it to a HLog file (using the writeTime) and then publish that in Zookeeper for each slave cluster.

We could detect that a peer is unreachable if the ReplicationSource didn't report after X time (configurable, not sure what should be the default). Here I'm still wondering what would be the best way to detect that a peer cluster is back... retrying connections to the peer ZK quorum? We also need to manage if the cluster is simply shut down (using the shutdown znode). At that point we stop queuing entries for that source and pile up all the Hlogs to process in a list in ZK. We also need a way here of telling the Master to not delete those logs. We should manage the fact that a hlog may be moved to the oldlogs directory so if the hlog isn't in the local log dir, it's probably in the other directory.

When the cluster comes back, we process in order all HLogs without merging with the current flow of entries since we would now have 2 different set of HLogs to keep track of (we could improve this in the future). It's only when we reach the current HLog file that we flip the switch to take new entries. I expect that to be very tricky.

Even trickier is keeping track of those HLogs when a RS dies on the master cluster. The pile of HLogs to process will still be in ZK along the latest HLogKey that was processed. It means we have to somehow hand off that processing to some or one RS. What I'm thinking is that the master, when done splitting logs, should hand that pile to a single RS which will open a new ReplicationSource and hopefully complete the replication.

We can use the information published in ZK to learn the situation of each replication stream per peer and show that in a UI.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833142#action_12833142 ] 

ryan rawson commented on HBASE-2223:
------------------------------------

the problem with automatically spawning hard hitting jobs is they usually get spawned at like the worst time possible and take down your site, etc, etc.

We probably need some kind of replication status/UI/etc.  Hopefully we can leverage ZK and put most/all of the shared state in there.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833187#action_12833187 ] 

Andrew Purtell commented on HBASE-2223:
---------------------------------------

bq. When the slave cluster comes back, initiate a MR job like HBASE-2221

Or just restart replication after the slave is back online and interleave edits from the queue with new ones as necessary?

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-2223:
--------------------------------------

      Description: 
We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.

I think we could:

 - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
 - Keep track of the duration of the partition
 - When the slave cluster comes back, initiate a MR job like HBASE-2221 

Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.
    Fix Version/s: 0.21.0

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833192#action_12833192 ] 

Jean-Daniel Cryans commented on HBASE-2223:
-------------------------------------------

bq. Or just restart replication after the slave is back online and interleave edits from the queue with new ones as necessary?

One thing I forgot to add is that the job would be configured to only treat timestamps newer than x.

So the problem with resending those edits is something I tackled in HBASE-2197. If one cluster gets very very late like 2 hours, we have to decide where we are going to get that data. One option is using the old log files but also the log files that are currently in the region servers. It ain't so bad, but what happens in the case of failure? In 2197, the first solution I described involves using a distributed queue where all RS would participate in processing each log file and interleave them with the rest of the stream.

Another option is keeping yet another set of log files, separate from the "normal" ones, that we use to flush log entries if some cluster gets late. Then if a region server dies, we process both sets of log files.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833198#action_12833198 ] 

Andrew Purtell commented on HBASE-2223:
---------------------------------------

{quote}
bq. Or just restart replication after the slave is back online and interleave edits from the queue with new ones as necessary?
[...] this could cause application problems if they need to assume a single arrow of time of edits, and not wanting to see a partial world view
{quote}

Ok, makes sense for the first cut. Especially if replication logic is pluggable and subclassable. Applications can plug in their own policies to do what makes the most sense for them. 

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833203#action_12833203 ] 

ryan rawson commented on HBASE-2223:
------------------------------------

the previous plan was to NOT delete the log files if replication needed them... Sounds like you changed that?

If replication needs a log file and the server crashed and they are set to be split, we need to NOT delete the logfile, move them to a holding area perhaps and then get someone to pick the up and send them along.


A 2 hour outage isnt that much, I'd say that we should buffer logs until someone decides it's not worth the disk space.  Ie: make it a top level admin action/alert and give the administrator an option to drop the log retention and then do alternative catch up later.  

It would be better to hold on to a few TB of replication logs then replay that after 24-48 hours of downtime than to mess with the map-reduce stuff, since you'd have to be careful to hopefully avoid duplicating keyvalues.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.