You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jieshan Bean (JIRA)" <ji...@apache.org> on 2011/07/14 13:55:59 UTC

[jira] [Created] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
-----------------------------------------------------------------------------------------

                 Key: HBASE-4095
                 URL: https://issues.apache.org/jira/browse/HBASE-4095
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.90.3
            Reporter: Jieshan Bean


Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:

1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.

2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 

{noformat}
private void checkLowReplication() {
// if the number of replicas in HDFS has fallen below the initial
// value, then roll logs.
try {
  int numCurrentReplicas = getLogReplication();
  if (numCurrentReplicas != 0 &&
	  numCurrentReplicas < this.initialReplication) {
	LOG.warn("HDFS pipeline error detected. " +
		"Found " + numCurrentReplicas + " replicas but expecting " +
		this.initialReplication + " replicas. " +
		" Requesting close of hlog.");
	requestLogRoll();
	logRollRequested = true;
  }
} catch (Exception e) {
  LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
	  " still proceeding ahead...");
}
}
{noformat}
3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
But the lock may be carried by the cacheflush threads.

4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.

Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):

2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069396#comment-13069396 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Sorry for misunderstood your comments about the logs:

Yes, I copied the logs together, but make up them with the wrong order. Sorry.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086792#comment-13086792 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

+1 on patch version 9.
Minor comment, timeout parameter should be long:
{code}
+  void batchWriteAndWait(HTable table, int start, boolean expect, int timeout)
{code}

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V5-Branch.patch

@Ted,
{noformat}
But the rollWriter was able to executed in time. So the following roll-requests were skippted.

I think you meant:
But the rollWriter wasn't able to execute in time. And the following roll-requests were skipped.
{noformat}
Yes, you're right. Sorry:(

@ramkrishna,
There's several possibilities of "why did rollWriter not executed in time". It may not get the lock of "cacheFlushLock". The patch has taken care of it. 
If one request of rollWriter got blocked(was not allowed to hold cacheFlushLock), it will not effect the following requests.
There's 3 scenarios of request rollWriter:
1. Periodic execution.
2. Current Hlog writer has larger than 256M(It's a configurable parameter).
3. Current Hlog replicas is less than the initial replicas.

The patch of "disable the rolling-request" is just disable the request from scenario 3.
We needn't expose the new properties in hbase-default.xml, for they have the default value.


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V7-branch.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069403#comment-13069403 ] 

stack commented on HBASE-4095:
------------------------------

You say "But the rollWriter may be blocked this time"  Why is it blocked?   And does it just stay blocked for ever? (Thanks for the help making me understand Jieshan).

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu reopened HBASE-4095:
---------------------------


TestLogRolling failed in build 2040.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069393#comment-13069393 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

About the report, I mixed up several tests.
For the first time, I ran all the tests, but several got error(It's the hostName problems)..So I changed some code ,and run the failed test again. All the tests passed.

Before the patch, this.logRollRequested = true was placed in the method of checkLowReplication(). But if rollWriter is blocked, all the following roll will be skippted. That's why I made the modification and change the place of this.logRollRequested = true.

Though the name of this.logRollRequested seems misleading. But if don't change its place, the rollWriter may not run a for long time, and the HLog could becomes very large.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081438#comment-13081438 ] 

ramkrishna.s.vasudevan commented on HBASE-4095:
-----------------------------------------------

Patch looks good. 



> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082810#comment-13082810 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

I will do it immediately. Thanks stack:)
(LLT means the unit tests based on the exist test code. I'll try to write one case.)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu reassigned HBASE-4095:
-----------------------------

    Assignee: Jieshan Bean

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067508#comment-13067508 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

I ran TestLogRolling manually twice and TestDistributedLogSplitting once.
They passed.

@Jieshan:
Running related unit tests before submission is always a good idea.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087254#comment-13087254 ] 

stack commented on HBASE-4095:
------------------------------

Thank you for your persistence Jieshan on trying to get this patch in.  I'm basically +1 on v9.  Below is a small question.

{code}
-        this.logRollRequested = false;
+        this.logRollRunning = false;
{code}

Should the setting of this.logRollRunning be set instead in a finally block in here in rollWriter?  If an exception thrown after we set the logRollRunning, it looks like logRollRunning could stay set.   My guess is that not doing this would probably not be noticed in that we probably crash out the regionserver if a rollWriter fails but having the flag stuck set might make for some unexpected state?

So, it looks like we'll roll 5 times by default before we'll turn off the low replication log rolling facility -- which is better than a log per sync, right?

You seem to lose some 'liveness' regards file replication setting.  In code before this patch, when rollwriter ran, it'd ask the FS what the replication on the new writer is and that going forward would be the replication to use.  See here:

{code}
-      int nextInitialReplication = fs.getFileStatus(newPath).getReplication();
{code}

Instead you set the replication once on instantiation of HLog.  See here:

{code}
+    this.minTolerableReplication = conf.getInt(
+        "hbase.regionserver.hlog.tolerable.lowreplication",
+        this.fs.getDefaultReplication());
{code}

.. and rather than ask the files replication you use the filesystem default value.

Do you have a good reason for changing this behavior?

Thanks Jieshan.






> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069390#comment-13069390 ] 

stack commented on HBASE-4095:
------------------------------

Jieshan.  Is there something messed up in the logs you report way up at the top of this issue?  Time starts to go backwards if I look at the log timestamps.  We go:

2011-07-06 15:29:57,478
2011-07-06 15:28:59,236
2011-07-06 15:29:46,714

Your patch seems a little odd in that we do this.logRollRequested = true; inside in a method named rollWriter.  That doesn't seem right; i.e. at the start of log rolling, we set a flag to say we have requested a log roll.

Is the issue that this.logRollRequested is being set to true but its not being cleared appropriately?  Once set, we'll never roll the logs again though the replicas are < 3?



> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V9-trunk.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068183#comment-13068183 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

I wonder if there is other reason.
I added else block for cacheFlushLock.tryLock() call and placed a breakpoint inside else block.

testLogRollOnDatanodeDeath failed but the breakpoint wasn't hit.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V9-branch.patch
                HBase-4095-V9-trunk.patch

I increased the timeout value to 1000. It's enough. Thanks, Ted.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: surefire-report-branch.html
                HBASE-4095-trunk-V3.patch
                HBASE-4095-90-V3.patch

I misunderstood the method of HLog#getLogReplication, this method is also get the default replicas value of a new created block(Not for the old one).
I just changed the place of "this.logRollRequested = true".

I have taken tests on the branch patch, but the trunk one, I didn't build up the env for it.

Though the patch has little modification, But problem of the large HLog can be well resolved.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V4-Branch.patch
                TestResultForPatch-V4.rar

The analysis before this has told why did the large-Hlog file occur.
Be brief, it's because the variable of "HLog#logRollRequested". Once the rollWriter was requested, it's value was set as true. But the rollWriter was able to executed in time. So the following roll-requests were skippted. 

In the original code, it using the following code to try to get the initial Hlog replicas:
{noformat}
 int nextInitialReplication = fs.getFileStatus(newPath).getReplication(); 
{noformat}
But it's value always equals with the default replicas number.Once the actual datanode count is less than default replicas value. rollWriter will be requested in each sync.
So rollWriter was executed one by one with a high frequency.

The patch of V4 is trying to solve both the "large-file" problem and the "high-frequency roll request" problem.

1. Replace the variable of "HLog#logRollRequested" as "HLog#logRollRunning". A new request will be skippted while another rollWriter was running.
2. Add a new variable named "minTolerableReplication". Once the actual replicas is less than it, send a rollWriter request.
3. Once a sequence of unreasonable rollWriter executed, disable roll-request from checkLowReplication(). Enable it while replicas return to normal.

I've take many tests on the patch. Please check them from the attachment of "TestResultForPatch-V4.rar".

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: LatestLLTResults-20110810.rar

I repeated the LLT test for several times. All the tests past except one--TestHFileReaderV1. And I found it had got failure before this patch(I have no time to dig in. But the failure has no relationship with this patch).
Please find the results from the attached file "LatestLLTResults-20110810.rar"

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBASE-4095-trunk.patch
                HBASE-4095-90.patch

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90.patch, HBASE-4095-trunk.patch
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: hbase-root-regionserver-193-195-5-111.rar


I attached the full logs of the regionserver.
"hbase-root-regionserver-193-195-5-111.rar":
1. One datanode was down, it tricker the rollWriter.
2. Before the patch:
  private void checkLowReplication() {
    // if the number of replicas in HDFS has fallen below the initial
    // value, then roll logs.
    try {
      int numCurrentReplicas = getLogReplication();
      if (numCurrentReplicas != 0 &&
          numCurrentReplicas < this.initialReplication) {
        LOG.warn("HDFS pipeline error detected. " +
            "Found " + numCurrentReplicas + " replicas but expecting " +
            this.initialReplication + " replicas. " +
            " Requesting close of hlog.");
        requestLogRoll();
        logRollRequested = true;
      }
    } catch (Exception e) {
      LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
          " still proceeding ahead...");
    }
  }
 
  It requested to rollWriter, and set logRollRequested as true.
  But the rollWriter may be blocked this time. So the logRollRequested is always true until the next success rollWriter finished.
  During the time , client was writing the data into this RegionServer. If Hlog was larger than 256M(Suppose it is), it should tricker a new rollWriter, 
  but each time will skip the request, for logRollRequested was still true:  
	public void sync() throws IOException {
	
	   -----
       if (!logRollRequested) {
          checkLowReplication();
          if (this.writer.getLength() > this.logrollsize) {
            requestLogRoll();
          }
       }
    }

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067571#comment-13067571 ] 

Hudson commented on HBASE-4095:
-------------------------------

Integrated in HBase-TRUNK #2039 (See [https://builds.apache.org/job/HBase-TRUNK/2039/])
    HBASE-4095  Hlog may not be rolled in a long time if checkLowReplication's
               request of LogRoll is blocked (Jieshan via Ted Yu)

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067509#comment-13067509 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Sorry, about the trunk patch, I didn't have the env for it(Though I have taken enough tests for the branch). But I'll remember to do the unit tests for trunk the next time.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBASE-4095-trunk-V3.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067514#comment-13067514 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

@Jieshan:
I use the following command to run test suite on Linux:
{code}
nohup mvn package > 4095.results
{code}

I normally wait for TRUNK build to pass before resolving an issue.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081027#comment-13081027 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

Ran through unit tests.
+1 on patch.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069404#comment-13069404 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

I don't know why it got blocked for a long time. The rollWriter must get the lock of this.cacheFlushLock, if the RS's flushing now, the rollWriter can't be executed in time(The lock is carried by CacheFlush threads).

I think use tryLock is better(If it can't get the lock in several second, we just skip it this time).


  public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
  } 

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V7-trunk.patch
                HBase-4095-V7-branch.patch

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082644#comment-13082644 ] 

stack commented on HBASE-4095:
------------------------------

bq.  Though I've taken many times of unit tests on it, I'm running more

When I say unit test, I'm talking of adding some code to TestHLog under src/test that "proves" your patch fixes this issue.  Would that be possible?

(Whats LLT test?)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4095:
--------------------------

    Summary: Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked  (was: Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068407#comment-13068407 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

Rolled back in branch and TRUNK.

Waiting for better solution.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067511#comment-13067511 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

So I can resolve this issue??

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4095:
-------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed branch and trunk.  Thank you for your perserverance Jieshan.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4095:
-------------------------

    Fix Version/s:     (was: 0.90.4)
                   0.90.5

Moving out of 0.90.4

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069397#comment-13069397 ] 

stack commented on HBASE-4095:
------------------------------

"But if rollWriter is blocked, all the following roll will be skippted."

When is the above so?

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068175#comment-13068175 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

 
assertTrue("Missing datanode should've triggered a log roll",
        newFilenum > oldFilenum && newFilenum > curTime);

I've got the reasons why the TestLogRolling failed intermittently.

1. Before the patch of this issue, if one datanode was killed, it will trigger a log-roll. For it will get the lock of cacheFlushLock finally.

2. But after the modification, if the log-roll didn't get the lock in 5 second, it will skip the operations.
So it may not get the lock of cacheFlushLock in 5 second , so it failed.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4095:
--------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067506#comment-13067506 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

About the failure in trunk(Sorry, I just tested the branch patch), is it a problem?
Thanks, Ted.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087560#comment-13087560 ] 

Hudson commented on HBASE-4095:
-------------------------------

Integrated in HBase-TRUNK #2125 (See [https://builds.apache.org/job/HBase-TRUNK/2125/])
    HBASE-4095 Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V7-trunk.patch
                HBase-4095-V7-branch.patch

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V9-branch.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V9-trunk.patch
                HBase-4095-V9-branch.patch

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068747#comment-13068747 ] 

gaojinchao commented on HBASE-4095:
-----------------------------------

I added some log  and found that the initialReplication is zero. 
when we create a file in hdfs , If I don't write data , the replication should be zero.
So the solution has some issue.

2011-07-20 19:38:20,517 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] wal.HLog(478): gjc:rollWriter start1311161900517
2011-07-20 19:38:20,650 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] wal.HLog(478): gjc:rollWriter start1311161900650
2011-07-20 19:38:20,707 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] wal.HLog(518): gjc:updateLock start1311161900707
2011-07-20 19:38:20,707 WARN  [RegionServer:1;C4C3.site,41763,1311161899551] wal.HLog(532): gjc:initialReplication start0
2011-07-20 19:38:21,238 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] wal.HLog(518): gjc:updateLock start1311161901238
2011-07-20 19:38:21,239 WARN  [RegionServer:0;C4C3.site,35697,1311161899494] wal.HLog(532): gjc:initialReplication start0
2011-07-20 19:38:41,726 WARN  [IPC Server handler 4 on 37616] wal.HLog(478): gjc:rollWriter start1311161921726
2011-07-20 19:38:41,769 WARN  [IPC Server handler 4 on 37616] wal.HLog(518): gjc:updateLock start1311161921769
2011-07-20 19:38:41,769 WARN  [IPC Server handler 4 on 37616] wal.HLog(532): gjc:initialReplication start0


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081447#comment-13081447 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Yes, this problem happens in a special scenario. Though I've taken many times of unit tests on it, I'm running more. The result will be attached later.
I also took several days tests on the patch in our real environment(During the time, I killed some datanodes and restarted them, and repeated the steps again and again),I paid special attention to the HLog paths, it' indeed a good solution to this problem.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBASE-4095-trunk-v2.patch
                HBASE-4095-90-v2.patch

Thanks, Ted.
I have made the change of the patches according to your suggestions.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067499#comment-13067499 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

@Jieshan:
I saw the following test failures using patch TRUNK:
{code}
Running org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 108.675 sec <<< FAILURE!
--
Running org.apache.hadoop.hbase.master.TestDistributedLogSplitting
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 162.863 sec <<< FAILURE!
{code}
Running the tests manually passed.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V9-branch.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081423#comment-13081423 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Review Board related address: https://reviews.apache.org/r/1354/

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086890#comment-13086890 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Sorry, it's 10000 not 1000.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: flowChart-IntroductionToThePatch.gif
                HBase-4095-V9-trunk.patch
                HBase-4095-V9-branch.patch

Thanks, Ted. I've changed them.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBASE-4095-90-V3.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068191#comment-13068191 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Here's the logs from a failure-test:
{noformat}
---------------Rolling from the first test-------------------------------
2011-07-20 12:35:59,121 INFO  [main] wal.HLog(529): Roll /user/root/.logs/C4C2.site,42700,1311136499029/C4C2.site%3A42700.1311136501182, entries=296, filesize=307097. New hlog /user/root/.logs/C4C2.site,42700,1311136499029/C4C2.site%3A42700.1311136559078
2011-07-20 12:35:59,121 DEBUG [main] wal.HLog(541): Last sequenceid written is empty. Deleting all old hlogs
2011-07-20 12:35:59,136 INFO  [main] wal.TestLogRolling(206): after flushing all regions and rolling logs there are 0 log files

---------------Rolling from the second test ()testLogRollOnDatanodeDeath----------------
2011-07-20 12:35:59,141 INFO  [main] wal.TestLogRolling(264): Replication=2
2011-07-20 12:37:02,434 WARN  [RegionServer:1;C4C2.site,42700,1311136499029.logSyncer] wal.HLog(1035): HDFS pipeline error detected. Found 1 replicas but expecting 2 replicas.  Requesting close of hlog.
2011-07-20 12:37:02,497 INFO  [RegionServer:1;C4C2.site,42700,1311136499029.logRoller] wal.HLog(529): Roll /user/root/.logs/C4C2.site,42700,1311136499029/C4C2.site%3A42700.1311136559078, entries=3, filesize=1043. New hlog /user/root/.logs/C4C2.site,42700,1311136499029/C4C2.site%3A42700.1311136622435
{noformat}

>From the logs, we can see that, the Hlog was indeed rolled.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Fix Version/s: 0.90.4
           Status: Patch Available  (was: Reopened)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079976#comment-13079976 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

bq. But the rollWriter was able to executed in time. So the following roll-requests were skippted.
I think you meant:
But the rollWriter wasn't able to execute in time. And the following roll-requests were skipped.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081440#comment-13081440 ] 

stack commented on HBASE-4095:
------------------------------

I'd clean up the extra spaces the patch adds before commit.  It adds a bunch.

Reviewing I kept scratching my head.  This seems to be a slightly odd case in that it happens on a cluster of three nodes or less yet we're changing the way a pretty important piece of code works.  Any chance of a unit test Jieshan?  The HLog has a bunch of tests that do various mock ups.  Shouldn't be too hard studying what goes on there to figure how to get a little test in.  If you think this too much, that's fine, we can commit since you got two +1s but I thought I'd ask anyways.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085701#comment-13085701 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

Config parameters are renamed in HLog but not in the test, please make them consistent.
{code}
+        "hbase.regionserver.hlog.lowreplication.tolerable"
{code}
A better name would be hbase.regionserver.hlog.tolerable.lowreplication
{code}
+        "hbase.regionserver.hlog.lowreplication.rolls", 5);
{code}
Would hbase.regionserver.hlog.lowreplication.rolllimit be more informative ?

In batchWriteAndWait() remaining always decrements by 200. This can be made more accurate through calling System.currentTimeMillis()

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068314#comment-13068314 ] 

Hudson commented on HBASE-4095:
-------------------------------

Integrated in HBase-TRUNK #2044 (See [https://builds.apache.org/job/HBase-TRUNK/2044/])
    HBASE-4095 revert, wait for further investigation
HBASE-4095 revert to previous way of initializing initialReplication

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HlogFileIsVeryLarge.gif

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067436#comment-13067436 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

I think the following parameter:
{code}
+    this.tryLockWaitingTime = 
+        conf.getLong("hbase.regionserver.logroll.trylock.waitingtime", 5000);
{code}
can be named 'hbase.regionserver.cacheFlushLock.waittime'

{code}
+        this.logRollRequested = true;
+        try {
+          if (closed) {
+            return regionsToFlush;
{code}
Should this.logRollRequested = true be moved after the return ?

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V9-branch.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V7-trunk.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069399#comment-13069399 ] 

stack commented on HBASE-4095:
------------------------------

Oh, ok.  I was unable to follow what was going on.  Mixing up their order threw me off (I'm still not sure what is going on here.  Maybe if you can answer my above question that would help -- thanks).

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V9-trunk.patch
                HBase-4095-V9-branch.patch

{quote}
Should the setting of this.logRollRunning be set instead in a finally block in here in rollWriter? If an exception thrown after we set the logRollRunning, it looks like logRollRunning could stay set. My guess is that not doing this would probably not be noticed in that we probably crash out the regionserver if a rollWriter fails but having the flag stuck set might make for some unexpected state?
{quote}
Yes, it's good suggestion. It should be put in the finally block. And I've changed it.
{quote}
So, it looks like we'll roll 5 times by default before we'll turn off the low replication log rolling facility - which is better than a log per sync, right?
{quote}
Yes.
{quote}
Do you have a good reason for changing this behavior?
{quote}
Maybe we misunderstood the method of :
{noformat}
 int nextInitialReplication = fs.getFileStatus(newPath).getReplication();
{noformat}
It always returns the default replications value(I've taken some tests to prove it. And I affirmed it from some hdfs experts). No matter how many live datanodes there and what's the actual replications.
{noformat}
this.minTolerableReplication = conf.getInt(
+        "hbase.regionserver.hlog.tolerable.lowreplication",
+        this.fs.getDefaultReplication());
{noformat}
So I added a new parameter "hbase.regionserver.hlog.tolerable.lowreplication". Suppose the default replication value is 3. Before the patch, once the replications decreased, rollWriter should be triggered. To some extend, it's unreasonable. Because the rest 2 replication is also tolerable sometime. So I made it configurable. 
That's why I changed the behavior.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HBase-4095-V9-branch.patch, HBase-4095-V9-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085558#comment-13085558 ] 

stack commented on HBASE-4095:
------------------------------

Thanks for doing the new test.  That helps.  I'm w/ J-D though that there are lots of timeouts.  Is that necessary?  The trouble with depending on timeouts is that they are not determinate and in different environments may behave differently.  If there is no other way, then so be it... 

There is:

{code}
-    // write some more log data (this should use a new hdfs_out)
-    writeData(table, 3);
-    assertTrue("The log should not roll again.", log.getFilenum() == newFilenum);
{code}

Where you remove an assertion.  Can you leave that in?

What difference between batchWriteData and writeData?  The pause?  Is the pause needed?

Thanks

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068184#comment-13068184 ] 

stack commented on HBASE-4095:
------------------------------

It failed for me just now on local machine.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V9-trunk.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067466#comment-13067466 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

Running tests for TRUNK patch.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068185#comment-13068185 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

Only two places was changed in HLog$rollWriter():

1. lock -> tryLock
2. Change the value of "this.initialReplication":
      //This method get expect but not the actual replicas of the Hlog file
        int nextExpectReplicas = fs.getFileStatus(newPath).getReplication();
          
      //Get the current replicas of the Hlog file
        int nextActualReplicas = -1;
        try
        {
            nextActualReplicas = getLogReplication();
        } catch (Exception e) {
            LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
                " still proceeding ahead...");
        }

       this.initialReplication = nextActualReplicas == -1 ? 
              nextExpectReplicas : nextActualReplicas;



> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079996#comment-13079996 ] 

ramkrishna.s.vasudevan commented on HBASE-4095:
-----------------------------------------------

@Jieshan,
I was analysing the initial logs attached.  As you said the Rolling of hLog has not happened due to some reason. I hope the cacheFlushLock was not allowed to be held by the HLog Roller.  If the rolling of logs has not happened due to this reason then the current soln will it take care of it ? 
One more point is, if the number of replicas doesnt for the Logroll to happen then we stop the log rolling as per the current patch. Correct me if am wrong.  
Is this behaviour ok? 
Should we expose the new properties in the hbase-default.xml?
Pls correct me if my analysis/query is wrong.  


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085569#comment-13085569 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

I used timeout to wait the logRoll request. Otherwise, we can't get the correct value by the calling of "log.isFewerReplicasRollEnabled()". A lot of sleeping indeed seems un-reasonable. I'm thinking about J-d's suggestion by using the tool of mockito(Though I don't know whether it could help). 
About the assertion, I'll put it back.
batchWriterData is used to write a list of value, but writeData could only put one record. The problem in this issue happens under the scenario of a lot of data were written into the table. So I used batchWriterData instead of writeData. 
Thanks.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: surefire-report-V5-trunk.html
                HBase-4095-V5-trunk.patch

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068285#comment-13068285 ] 

Hudson commented on HBASE-4095:
-------------------------------

Integrated in HBase-TRUNK #2043 (See [https://builds.apache.org/job/HBase-TRUNK/2043/])
    HBASE-4095 make cacheFlushLock.waittime longer

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069470#comment-13069470 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

stack, you can ignore this patch.
I'll give out a more considerate patch.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V6-trunk.patch
                HBase-4095-V6-branch.patch

Just modified the code comments for the patches of V6.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Status: Patch Available  (was: Open)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V8-trunk.patch
                HBase-4095-V8-branch.patch

I fixed some spellings and sentences acording to J-d's suggestion. About the sleepings, I have made some changes. Added a timeout mechanism to decrease the sleeping time.
But some sleepings is also required. Like the following:
{noformat}
 assertTrue(dfsCluster.stopDataNode(pipeline[1].getName()) != null);
 Thread.sleep(10000);
{noformat}
Maybe there's some differece according to different env. But some code like this is already there. 10000 maybe a empirical value.
I tried to use mockito. But it makes things more difficult in this scenario.
{quote}
This is not how we do brackets:
+  public boolean isFewerReplicasRollEnabled()
+  {
+    return fewerReplicasRollEnabled;
+  }
{quote}
I just got the Eclipse-based Apache Formatter xml and applied it in my editor. Before this, I format the code mannually:(. So this problem will not happen again.
{quote}
Where you remove an assertion. Can you leave that in?
{quote} 
I've putted them back.

stack, J-d, expecting your review again. Thanks a lot.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065757#comment-13065757 ] 

Jieshan Bean commented on HBASE-4095:
-------------------------------------

The modifications in the patch:
1. Move "logRollRequested = true;" out of "HLog#checkLowReplication()". But added in "HLog#rollWriter" after got the lock of cacheFlushLock.
2. Change the "this.cacheFlushLock.lock()" to "this.cacheFlushLock.tryLock" in the method of "HLog#rollWriter()" If it was blocked a long time, just skip the operations. 
I think this's reasonable.
3. "fs.getFileStatus(newPath).getReplication();" is get the expect replicas but not the actual one. "HLog.initialReplication" was changed in each calling of rollWriter. 
So I think its value should be the current replicas but not the expect value.

I have taken many tests about the patch. It behaves very well. During each test, I killed one of the datanode who carried one replicas of the HLog, a new HLog file is recreated
successfully. 

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll blocked
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>         Attachments: HBASE-4095-90.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067903#comment-13067903 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

Although build 2041 passed, we should investigate why TestLogRolling failed intermittently.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067504#comment-13067504 ] 

Ted Yu commented on HBASE-4095:
-------------------------------

Integrated to branch and TRUNK.

Thanks for the patch Jieshan.

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081424#comment-13081424 ] 

jiraposter@reviews.apache.org commented on HBASE-4095:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1354/
-----------------------------------------------------------

(Updated 2011-08-09 04:20:46.103510)


Review request for hbase.


Summary
-------

Issue: https://issues.apache.org/jira/browse/HBASE-4095

Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:

1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
3. requestLogRoll() just commit the roll request. It may not execute in time, like the following scenario(It returns at the beginning, for the numEntries.get() == 0):
    public byte [][] rollWriter() throws FailedLogCloseException, IOException {
    // Return if nothing to flush.
    if (this.writer != null && this.numEntries.get() <= 0) {
      return null;
    }
4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.


This addresses bug HBASE-4095.
    https://issues.apache.org/jira/browse/HBASE-4095


Diffs
-----

  /src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 1155186 

Diff: https://reviews.apache.org/r/1354/diff


Testing
-------

Please find the test result from the attachment of https://issues.apache.org/jira/browse/HBASE-4095


Thanks,

Jieshan



> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: RelatedLogs2011-07-28.txt

After I've taken more tests about this issue. I found the cause of "RollWriter blocked for a long time" is just a illusion.
(In order to tracing what happened, I add some new logs, please check the attached file "RelatedLogs2011-07-28")
Here's the analysis:
(Suppose there's only 3 nodes in the cluster)

1. HLog-rolling was requested after one datanode was killed.
2. Set HLog#logRollRequested as true after the request.
3. HLog#rollWriter was executed some time later. 
4. A new rollWriter was requested nearby the prev one. And there's no new entries during the time.
   HLog#rollWriter() was returned at the beginning, for the numEntries.get() == 0:
{noformat}
  public byte [][] rollWriter() throws FailedLogCloseException, IOException {
    // Return if nothing to flush.
    long stamp = System.currentTimeMillis();
    if (this.writer != null && this.numEntries.get() <= 0) {
      return null;
    }
    byte [][] regionsToFlush = null;
{noformat}

It happens with a high probabities if one datanode was down in such mini-cluster(3 nodes):

1.HLog#initialReplication is reset in each HLog#rollWriter.
{noformat}
  int nextInitialReplication = fs.getFileStatus(newPath).getReplication();
  this.initialReplication = nextInitialReplication;
{noformat}
The method of "fs.getFileStatus(newPath).getReplication()" could only get the default/expect replicas value(3).
So HLog#initialReplication is always 3.

2.HLog could only has 2 replicas for there's only 2 datanodes.
  So rollWriter was requested in each HLog#sync().
  
That's why it happens easily if one datanode was down.

I'm thinking about two solutions to this issue:

[Solution A]

Consider the actual live datanode count while reset the value of HLog#initialReplication.
If the live datanode count is less than the default replicas, its value should be the actual datanode count. Otherwise, set it to the default value.
But I've no idea about how to get the live datanode count. 

[Solution B]

Add a new configurable parameter to set the minimum tolerable number. It's default value is the default replicas.
If one datanode was killed which carried one replicas of HLog, but the actual replicas value is still larger than/equal with the configurable parameter value. We should not tricker a new RollWriter.
Something likes:
{noformat}
   private OutputStream hdfs_out;     // OutputStream associated with the current SequenceFile.writer
-  private int initialReplication;    // initial replication factor of SequenceFile.writer
+  
+  //Minimum tolerable replicas, if the actual value is less than that, the rollWriter would be trickered
+  private int minTolerableReplication; 
+  
   private Method getNumCurrentReplicas; // refers to DFSOutputStream.getNumCurrentReplicas
   final static Object [] NO_ARGS = new Object []{};
 
@@ -358,6 +361,9 @@
       }
     }
     this.maxLogs = conf.getInt("hbase.regionserver.maxlogs", 32);
+    
+    this.minTolerableReplication = 
+      conf.getInt("hbase.regionserver.hlog.mintolerablereplicas", 3);
     this.enabled = conf.getBoolean("hbase.regionserver.hlog.enabled", true);
     LOG.info("HLog configuration: blocksize=" +
       StringUtils.byteDesc(this.blocksize) +
@@ -480,7 +486,6 @@
       this.filenum = System.currentTimeMillis();
       Path newPath = computeFilename();
       HLog.Writer nextWriter = this.createWriterInstance(fs, newPath, conf);
-      int nextInitialReplication = fs.getFileStatus(newPath).getReplication();
       // Can we get at the dfsclient outputstream?  If an instance of
       // SFLW, it'll have done the necessary reflection to get at the
       // protected field name.
@@ -500,7 +505,6 @@
         // Clean up current writer.
         Path oldFile = cleanupCurrentWriter(currentFilenum);
         this.writer = nextWriter;
-        this.initialReplication = nextInitialReplication;
         this.hdfs_out = nextHdfsOut;
 
         LOG.info((oldFile != null?
@@ -1003,10 +1007,10 @@
     try {
       int numCurrentReplicas = getLogReplication();
       if (numCurrentReplicas != 0 &&
-          numCurrentReplicas < this.initialReplication) {
+          numCurrentReplicas < this.minTolerableReplication) {
         LOG.warn("HDFS pipeline error detected. " +
-            "Found " + numCurrentReplicas + " replicas but expecting " +
-            this.initialReplication + " replicas. " +
+            "Found " + numCurrentReplicas + " replicas but expecting no less than " +
+            this.minTolerableReplication + " replicas. " +
             " Requesting close of hlog.");
         requestLogRoll();
         logRollRequested = true;
{noformat}


> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085458#comment-13085458 ] 

Jean-Daniel Cryans commented on HBASE-4095:
-------------------------------------------

mintolerablereplicas and fewerreplicasrolllimit should be made more readable.

The log message "it's a sign of the live datanodes count is less than the minimum tolerable replicas" should be rewritten.

Fix spelling:

{quote}
+    // Force roll writer. The new log file will have the defalt replication,
+    // and the lowerReplcas roller will be enabled.
{quote}

This is not how we do brackets:

{code}
+  public boolean isFewerReplicasRollEnabled()
+  {
+    return fewerReplicasRollEnabled;
+  }
{code}

About the test, there's a _lot_ of sleeping in there which makes the test suite run even slower than it is. Any chance of making it not really on time (which is usually flaky) by using tools like mockito?

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment:     (was: HBase-4095-V9-trunk.patch)

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, HBase-4095-V6-trunk.patch, HBase-4095-V7-branch.patch, HBase-4095-V7-trunk.patch, HBase-4095-V8-branch.patch, HBase-4095-V8-trunk.patch, HlogFileIsVeryLarge.gif, LatestLLTResults-20110810.rar, RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, flowChart-IntroductionToThePatch.gif, hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira