You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "jack levin (Created) (JIRA)" <ji...@apache.org> on 2011/10/28 22:29:32 UTC
[jira] [Created] (HBASE-4695) WAL logs get deleted before region
server can fully flush
WAL logs get deleted before region server can fully flush
---------------------------------------------------------
Key: HBASE-4695
URL: https://issues.apache.org/jira/browse/HBASE-4695
Project: HBase
Issue Type: Bug
Components: wal
Affects Versions: 0.90.4
Reporter: jack levin
Priority: Blocker
Fix For: 0.90.5
To replicate the problem do the following:
1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
2. executing kill <pid> (where pid is a regionserver pid)
3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
5. Check namenode logs:
09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Assigned) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gaojinchao reassigned HBASE-4695:
---------------------------------
Assignee: gaojinchao
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated HBASE-4695:
--------------------------
Attachment: hbase-4695-0.92.txt
Proposed patch for 0.92
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140299#comment-13140299 ]
stack commented on HBASE-4695:
------------------------------
@Jack You want to try this patch?
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "jack levin (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140713#comment-13140713 ]
jack levin commented on HBASE-4695:
-----------------------------------
It would be nice to have a patch for 0.90.4 also.
Thanks,
-Jack
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139986#comment-13139986 ]
gaojinchao commented on HBASE-4695:
-----------------------------------
Latest Trunk version, test passed in a real cluster:
Region Server logs:
2011-10-31 03:32:42,922 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server C3S31,20020,1320034091400
2011-10-31 03:32:46,974 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server C3S31,20020,1320034091400; all regions closed.
2011-10-31 03:32:48,633 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Moved 7 log files to /hbase/.oldlogs
2011-10-31 03:32:49,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server C3S31,20020,1320034091400; zookeeper connection closed.
Namenode logs:
2011-10-31 03:32:46,988 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(192)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=listStatus src=/hbase/.logs/C3S31,20020,1320034091400 perm=root:supergroup:rwxr-xr-x
2011-10-31 03:32:46,991 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320045179340 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320045179340 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,992 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046155808 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046155808 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,994 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046186294 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046186294 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,996 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046216288 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046216288 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,998 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046255166 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046255166 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:47,206 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(192)) - ugi=webuser,webgroup ip=/158.1.130.33 cmd=listStatus src=/hbase/.logs/C3S31,20020,1320034091400 perm=root:supergroup:rwxr-xr-x
2011-10-31 03:32:48,518 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046295501 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046295501 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:48,633 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=rename src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046325013 dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046325013 perm=root:supergroup:rw-r--r--
2011-10-31 03:32:48,650 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(206)) - ugi=root,root,sfcb ip=/158.1.130.31 cmd=delete src=/hbase/.logs/C3S31,20020,1320034091400
2011-10-31 03:32:49,389 INFO FSNamesystem.audit (FSNamesystem.java:logAuditEvent(206)) - ugi=root,root,sfcb ip=/158.1.130.32 cmd=delete src=/hbase/.META./1028785192/.tmp
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-4695:
-------------------------
Resolution: Fixed
Hadoop Flags: Reviewed
Status: Resolved (was: Patch Available)
Applied branches and trunk. Thanks for the patch Gao.
I looked at this for a while to see if could make a test. It'd be timing based thing where we'd check that logs were not moved before flush had completed. Punting.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140960#comment-13140960 ]
stack commented on HBASE-4695:
------------------------------
@Jack Thanks boss for taking it for a spin.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gaojinchao updated HBASE-4695:
------------------------------
Attachment: HBASE-4695_Trunk_V2.patch
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141754#comment-13141754 ]
Hudson commented on HBASE-4695:
-------------------------------
Integrated in HBase-TRUNK #2397 (See [https://builds.apache.org/job/HBase-TRUNK/2397/])
HBASE-4695 WAL logs get deleted before region server can fully flush
stack :
Files :
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139548#comment-13139548 ]
Hadoop QA commented on HBASE-4695:
----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12501494/hbase-4695-0.92.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 javadoc. The javadoc tool appears to have generated -166 warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.master.TestDistributedLogSplitting
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/103//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/103//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/103//console
This message is automatically generated.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141822#comment-13141822 ]
Hudson commented on HBASE-4695:
-------------------------------
Integrated in HBase-0.92 #95 (See [https://builds.apache.org/job/HBase-0.92/95/])
HBASE-4695 WAL logs get deleted before region server can fully flush
stack :
Files :
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gaojinchao updated HBASE-4695:
------------------------------
Attachment: HBASE-4695_branch90_trial.patch
Go back to company and verify this patch.
If you are free, Please review it firstly.
The patch seems simple.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "jack levin (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140918#comment-13140918 ]
jack levin commented on HBASE-4695:
-----------------------------------
I confirmed that this fix is working on my cluster, when I shutdown
the region server, the logs are moved to .oldlogs only after the RS
fully shuts down.
-Jack
On Mon, Oct 31, 2011 at 7:20 PM, Hadoop QA (Commented) (JIRA)
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-4695:
-------------------------
Fix Version/s: 0.92.0
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140009#comment-13140009 ]
gaojinchao commented on HBASE-4695:
-----------------------------------
@Ted
Our laboratory network failure, I verify the newest patch tomorrow.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140297#comment-13140297 ]
stack commented on HBASE-4695:
------------------------------
+1 on patch. Looks good to me.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139532#comment-13139532 ]
Ted Yu commented on HBASE-4695:
-------------------------------
Originally closeWAL(true) would be called even if this.fsOk is false. But now:
{code}
+ if (this.fsOk) {
waitOnAllRegionsToClose(abortRequested);
+ if (!this.killed){
+ closeWAL(abortRequested ? false : true);
{code}
I think the call to closeWAL() should be placed outside if (this.fsOk) block.
The next step is to verify that the issue is really fixed.
Thanks for taking care of this, Jinchao.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139987#comment-13139987 ]
gaojinchao commented on HBASE-4695:
-----------------------------------
@Ted
I think your patch has some problem. when this.fsOk is false, we clean the Hlog, the data will lose.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139553#comment-13139553 ]
gaojinchao commented on HBASE-4695:
-----------------------------------
@Ted
Thanks for your reveiw.
I considered if this.fsOk is false , calling closeWAL() makes no sense.
Your approach is same as original logic, that is ok, I will verify next monday.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140847#comment-13140847 ]
gaojinchao commented on HBASE-4695:
-----------------------------------
@jack
I made a patch for branch90. This patch has a little change,you can merge this change to your version.
Because our network of laboratories alse failure, so I can't verify this.
sorry.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139630#comment-13139630 ]
Ted Yu commented on HBASE-4695:
-------------------------------
The test failures reported by HadoopQA were due to 'Too many open files'
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140044#comment-13140044 ]
Hadoop QA commented on HBASE-4695:
----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12501582/HBASE-4695_Trunk_V2.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 javadoc. The javadoc tool appears to have generated -166 warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.master.TestDistributedLogSplitting
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
org.apache.hadoop.hbase.master.TestMasterFailover
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/110//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/110//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/110//console
This message is automatically generated.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gaojinchao updated HBASE-4695:
------------------------------
Attachment: HBASE-4695_Branch90_V2.patch
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140852#comment-13140852 ]
Hadoop QA commented on HBASE-4695:
----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12501720/HBASE-4695_Branch90_V2.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/116//console
This message is automatically generated.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.92.0, 0.90.5
>
> Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region
server can fully flush
Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated HBASE-4695:
--------------------------
Status: Patch Available (was: Open)
Submit for patch testing.
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
> Key: HBASE-4695
> URL: https://issues.apache.org/jira/browse/HBASE-4695
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 0.90.4
> Reporter: jack levin
> Assignee: gaojinchao
> Priority: Blocker
> Fix For: 0.90.5
>
> Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira