You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2008/11/19 19:28:44 UTC

[jira] Created: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

[performance] The replay of logs on server crash takes way too long
-------------------------------------------------------------------

                 Key: HBASE-1008
                 URL: https://issues.apache.org/jira/browse/HBASE-1008
             Project: Hadoop HBase
          Issue Type: Improvement
            Reporter: stack
            Priority: Critical
             Fix For: 0.20.0


Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1008:
-------------------------

    Fix Version/s:     (was: 0.20.0)
                   0.19.0

As is, this replay stuff is unacceptable.  Moving into 0.19.0.

Replay is slowing down with time.

Replay is not even multithread -- runs in series.

If master is shutdown during replay, looks like we lose edits (the region files will not be closed).

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710117#action_12710117 ] 

Jean-Daniel Cryans commented on HBASE-1008:
-------------------------------------------

Stack, we do read all into memory. I guess we can do what you described. I will open a new jira for that.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch, hbase-1008-v4-0.19.patch, hbase-1008-v4.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708984#action_12708984 ] 

Jean-Daniel Cryans commented on HBASE-1008:
-------------------------------------------

Sounds great. Just to be sure that I understand what you wrote, you basically think that we should reverse the way the latest patch works? Multi threaded reads and a single writer?

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708760#action_12708760 ] 

stack commented on HBASE-1008:
------------------------------

J-D, what you think of suggestion over here: https://issues.apache.org/jira/browse/HBASE-1394?focusedCommentId=12708663&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12708663?

I'm not going to bother with pool of writers for 0.20.0 -- logging is back to fast enough and besides, looks like we could do with some friction since its so easy overrunning compactions -- but the bit where we add timestamp to HLogKey and we then run multiple threads in master splitting up the N logs, hows that sounds?  Could cut recover in 3 or 4 or ten even if we ran ten concurrent splitter threads in master?

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709491#action_12709491 ] 

stack commented on HBASE-1008:
------------------------------

J-D. Any chance of backporting this too for 0.19.3?

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12685373#action_12685373 ] 

Jonathan Gray commented on HBASE-1008:
--------------------------------------

Great work, JD!  I've not tested the patch but read through it and looks good.  One thing though... Might be better to have some default setting of a max thread pool size and farm out to them.  In my case, I had >1000 logs to process... Log reprocessing time is when we least want to run into OOME.  With that many java threads, you run into OOME errors either from running out of stack, heap, or even worse you will cause system problems by surpassing the linux user process limit.  In (recent) experiences, java will keep going fine and go past the soft limits (i had hard limit way up to 65535 on nproc) but a bunch of other stuff will stop working (sometimes even being unable to ssh in to that machine or user).

There's a nifty java thing, ThreadPoolExecutor:  http://java.sun.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html

Or more simply, could do it in batches of 50 or so at a time.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1008:
--------------------------------------

    Attachment: hbase-1008-v4.patch
                hbase-1008-v4-0.19.patch

Patches for 0.19 and trunk with the number of threads as a constant and with more javadoc comments. Would need a +1 from someone who tested it please.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch, hbase-1008-v4-0.19.patch, hbase-1008-v4.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709604#action_12709604 ] 

stack commented on HBASE-1008:
------------------------------

I tested it and it works.

Please fix following when you apply:

There are two lines emitted when HLog is done:

{code}
2009-05-14 21:40:08,467 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Took 41393ms
2009-05-14 21:40:09,984 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/.logs/aa0-000-15.u.powerset.com_1242336420277_60021
{code}

Can the time taken be added to the "file splitting completed" line?

I think you can name executor threads..... would help with log lines like this:

2009-05-14 21:40:02,309 [pool-1-thread-2] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Thread got 62947 to process

Who are the edits for?  Add in region name I'd say. 

Otherwise, looks good.

We still need to rewrite it -- if crash during this processing we're hosed.. but this is a nice speedup.  I'd say up the default number of threads J-D from 3 to 5 or 10 even?

Good stuff.

+1 after making above logging hcanges.



> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch, hbase-1008-v4-0.19.patch, hbase-1008-v4.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710074#action_12710074 ] 

stack commented on HBASE-1008:
------------------------------

J-D, is it true that we read in all the logs before we start splitting?  It looks that way after going back to the patch.  If so, I missed that -- my fault -- and I think this a prob.

Theoretically, we can have at most 64 logs under a regionserver, each of which has ~64MB of edits.  Thats 4G of edits that we need to pull in before we start processing.

Can we not run the writer threads every Nth file read, say, every 5 or 10 even?

Thanks.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch, hbase-1008-v4-0.19.patch, hbase-1008-v4.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689053#action_12689053 ] 

stack commented on HBASE-1008:
------------------------------

J-D, this patch reads all edits into memory.  I suppose thats OK?  IIRC, the log is rotated after N edits rather than after its grown to a particular size.  If the log individual edits are very large, we could blow out the heap?

Currently number of threads == number of regions in particular commit log?

You might try setting bigger buffer on SequenceFile.Reader?  Might make things run faster.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1008:
--------------------------------------

    Fix Version/s: 0.19.3

No problem.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704974#action_12704974 ] 

Jean-Daniel Cryans commented on HBASE-1008:
-------------------------------------------

This patch was applied on openplaces (not spaces stack ;) ) main cluster (which runs on a 0.19 branch) the day I posted the v2 patch. Didn't get any bug. I will commit this patch with a bounded number of threads next week when I come back from vacation.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658267#action_12658267 ] 

stack commented on HBASE-1008:
------------------------------

HBASE-1048 should help; maximum 64 logs allowed before flush of region with oldest edit is forced.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649244#action_12649244 ] 

stack commented on HBASE-1008:
------------------------------

Replay of 1084 files took 1 hour, 30 minutes.  During this time, good part of the cluster was down.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710435#action_12710435 ] 

Jim Kellerman commented on HBASE-1008:
--------------------------------------

I'm not a big fan of having to read all the logs into memory.

My suggestion would be for each unique region in the HLog(s), create a blocking queue and a thread that will
dequeue entries and write them directly to the log file. Then you have one thread doing the reading and multiple
threads writing, and the memory footprint is reduced significantly.

Make sense?

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch, hbase-1008-v4-0.19.patch, hbase-1008-v4.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-1008.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed in branch and trunk with Stack's suggestions.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0, 0.19.3
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch, hbase-1008-v4-0.19.patch, hbase-1008-v4.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1008:
-------------------------

    Fix Version/s:     (was: 0.19.0)
                   0.20.0

Looking at this, logging needs to be rethought. In bigtable paper, the split is distributed. If we're going to have 1000 logs, we need to distribute or at least multithread the splitting.

1. As is, regions starting up expect to find one reconstruction log only.  Need to make it so pick up a bunch of edit logs and it should be fine that logs are elsewhere in hdfs in an output directory written by all split participants whether multithreaded or a mapreduce-like distributed process (Lets write our distributed sort first as a MR so we learn whats involved; distributed sort, as much as possible should use MR framework pieces).  On startup, regions go to this directory and pick up the files written by split participants deleting and clearing the dir when all have been read in.  Making it so can take multiple logs for input, can also make the split process more robust rather than current tenuous process which loses all edits if it doesn't make it to the end without error.
2. Each column family rereads the reconstruction log to find its edits.  Need to fix that.  Split can sort the edits by column family so store only reads its edits.

Too much work involved here to make it into 0.19.  Moving it out.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1008:
--------------------------------------

    Attachment: 1008-v2.patch

Patch I'm currently using.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-1008:
----------------------------

    Assignee: Jean-Daniel Cryans

Did you say in the meeting that you were going to test this on openspaces J-D?  Assigning you under that assumption.  Assign to no-one if I have it wrong.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649120#action_12649120 ] 

stack commented on HBASE-1008:
------------------------------

Its looking like this replay could take an hour.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708998#action_12708998 ] 

stack commented on HBASE-1008:
------------------------------

I've changed my mind after reading this patch.  This patch looks great and the amount of splitting processed above -- 3M in ~90seconds -- is good next place to go regards log recovery.

+1 on commit but make the upper bound on threads a configuration (doesn't have to be in hadoop-default.xml -- let fellas read code to find it).

Meantime, I'll go work elsewhere on bounding size of logs so what shows up in splitlog can be expected to be of reasonable size -- not of a size that will blow out mem.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1008:
-------------------------

    Priority: Blocker  (was: Critical)

Made it a 0.20.0 blocker since its performance issue -- theme for 0.20.0 -- and because as is its liable to lose data.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "schubert zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689459#action_12689459 ] 

schubert zhang commented on HBASE-1008:
---------------------------------------

I am using 0.19.1 with this patch for 4 days. It is fine.
I also want ask questions like stack's.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683995#action_12683995 ] 

Jean-Daniel Cryans commented on HBASE-1008:
-------------------------------------------

A multi-threaded version that I run at openplaces was able to process 33 logs in a "record" time and our job didn't even failed like it usually does : 

{quote}
2009-03-20 15:06:24,047 INFO org.apache.hadoop.hbase.regionserver.HLog: Splitting 33 log(s) in hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020
2009-03-20 15:06:24,047 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 1 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237560687378
2009-03-20 15:06:24,106 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Adding queue for entities,,1236805004423
2009-03-20 15:06:25,443 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100006 entries
2009-03-20 15:06:25,459 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 2 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237560790320
2009-03-20 15:06:25,879 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Adding queue for hbase_types,426,1225564254435
2009-03-20 15:06:27,101 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100867 entries
2009-03-20 15:06:27,103 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561649939
2009-03-20 15:06:28,694 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 101754 entries
2009-03-20 15:06:28,696 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 4 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561658514
2009-03-20 15:06:33,324 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 332220 entries
2009-03-20 15:06:33,327 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 5 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561669181
2009-03-20 15:06:38,707 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 349439 entries
2009-03-20 15:06:38,711 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 6 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561688463
2009-03-20 15:06:40,922 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 207909 entries
2009-03-20 15:06:40,925 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 7 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561698495
2009-03-20 15:06:42,048 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 102829 entries
2009-03-20 15:06:42,050 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 8 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561703659
2009-03-20 15:06:44,199 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 204528 entries
2009-03-20 15:06:44,201 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 9 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561715064
2009-03-20 15:06:46,875 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 225964 entries
2009-03-20 15:06:46,878 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 10 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561726289
2009-03-20 15:06:47,885 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100645 entries
2009-03-20 15:06:47,887 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 11 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561793535
2009-03-20 15:06:49,198 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 105605 entries
2009-03-20 15:06:49,222 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 12 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561854543
2009-03-20 15:06:50,227 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100363 entries
2009-03-20 15:06:50,229 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 13 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561941125
2009-03-20 15:06:51,305 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100648 entries
2009-03-20 15:06:51,307 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 14 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561953452
2009-03-20 15:06:53,111 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 159954 entries
2009-03-20 15:06:53,113 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 15 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237561986701
2009-03-20 15:06:54,450 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 101025 entries
2009-03-20 15:06:54,452 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 16 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562003837
2009-03-20 15:06:55,717 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100449 entries
2009-03-20 15:06:55,719 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 17 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562016248
2009-03-20 15:06:56,682 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 101244 entries
2009-03-20 15:06:56,699 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 18 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562049500
2009-03-20 15:06:57,749 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100274 entries
2009-03-20 15:06:57,751 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 19 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562060231
2009-03-20 15:06:59,012 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 111015 entries
2009-03-20 15:06:59,014 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 20 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562127374
2009-03-20 15:06:59,999 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100373 entries
2009-03-20 15:07:00,001 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 21 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562177943
2009-03-20 15:07:01,001 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100213 entries
2009-03-20 15:07:01,003 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 22 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562277537
2009-03-20 15:07:04,116 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 192782 entries
2009-03-20 15:07:04,119 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 23 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562304890
2009-03-20 15:07:05,774 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 101842 entries
2009-03-20 15:07:05,776 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 24 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562315409
2009-03-20 15:07:06,843 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 104371 entries
2009-03-20 15:07:06,845 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 25 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562321140
2009-03-20 15:07:08,213 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 102252 entries
2009-03-20 15:07:08,215 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 26 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562342084
2009-03-20 15:07:09,371 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100330 entries
2009-03-20 15:07:09,373 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 27 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562347414
2009-03-20 15:07:12,583 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 258947 entries
2009-03-20 15:07:12,585 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 28 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562678929
2009-03-20 15:07:13,926 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100082 entries
2009-03-20 15:07:13,928 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 29 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562683228
2009-03-20 15:07:15,216 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100115 entries
2009-03-20 15:07:15,218 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 30 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562708153
2009-03-20 15:07:16,418 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100030 entries
2009-03-20 15:07:16,420 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 31 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562734111
2009-03-20 15:07:17,783 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100054 entries
2009-03-20 15:07:17,785 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 32 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562748277
2009-03-20 15:07:19,902 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Pushed 100116 entries
2009-03-20 15:07:19,904 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 33 of 33: hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020/hlog.dat.1237562763336
2009-03-20 15:07:36,115 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Thread got 49699 to process
2009-03-20 15:07:36,114 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Thread got 4357600 to process
2009-03-20 15:07:36,168 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/entities/1200514131/oldlogfile.log and region entities,,1236805004423
2009-03-20 15:07:36,199 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/entities/862114639/oldlogfile.log and region hbase_types,426,1225564254435
2009-03-20 15:07:36,599 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 49699 total edits to hbase_types,426,1225564254435 in 484ms
2009-03-20 15:07:44,734 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row <> in tableName .META.: location server 192.168.1.109:62020, location region name .META.,,1
2009-03-20 15:07:51,633 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 4357600 total edits to entities,,1236805004423 in 15517ms
2009-03-20 15:07:51,633 INFO org.apache.hadoop.hbase.regionserver.HLog: Took 87586ms
2009-03-20 15:07:51,650 INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs://factory01.lab.mtl:9200/hbase/amsterdam_factory/log_192.168.1.111_1237511553894_62020
{quote}

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-1008:
--------------------------------------

    Attachment: hbase-1008-3.patch

This third version of the patch adds a bounded thread pool.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 1008-v2.patch, hbase-1008-3.patch
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and repay is running at rate of about 20 seconds each.  Meantime these regions are not online.  This is way too long to wait on recovery for a live site.  Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.