You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "jack levin (Created) (JIRA)" <ji...@apache.org> on 2011/10/28 22:29:32 UTC

[jira] [Created] (HBASE-4695) WAL logs get deleted before region server can fully flush

WAL logs get deleted before region server can fully flush
---------------------------------------------------------

Key: HBASE-4695
URL: https://issues.apache.org/jira/browse/HBASE-4695
Project: HBase
Issue Type: Bug
Components: wal
Affects Versions: 0.90.4
Reporter: jack levin
Priority: Blocker
Fix For: 0.90.5

To replicate the problem do the following:

1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
2. executing kill <pid> (where pid is a regionserver pid)
3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:

09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close

4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
5. Check namenode logs:

09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749

Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao reassigned HBASE-4695:
---------------------------------

    Assignee: gaojinchao
    
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4695:
--------------------------

    Attachment: hbase-4695-0.92.txt

Proposed patch for 0.92
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140299#comment-13140299 ] 

stack commented on HBASE-4695:
------------------------------

@Jack You want to try this patch?
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "jack levin (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140713#comment-13140713 ] 

jack levin commented on HBASE-4695:
-----------------------------------

It would be nice to have a patch for 0.90.4 also.


Thanks,

-Jack


                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139986#comment-13139986 ] 

gaojinchao commented on HBASE-4695:
-----------------------------------

Latest Trunk version, test passed in a real cluster:

Region Server logs:
2011-10-31 03:32:42,922 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server C3S31,20020,1320034091400
2011-10-31 03:32:46,974 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server C3S31,20020,1320034091400; all regions closed.
2011-10-31 03:32:48,633 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Moved 7 log files to /hbase/.oldlogs
2011-10-31 03:32:49,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server C3S31,20020,1320034091400; zookeeper connection closed.

Namenode logs:
2011-10-31 03:32:46,988 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(192)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=listStatus	src=/hbase/.logs/C3S31,20020,1320034091400	perm=root:supergroup:rwxr-xr-x
2011-10-31 03:32:46,991 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320045179340	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320045179340	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,992 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046155808	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046155808	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,994 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046186294	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046186294	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,996 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046216288	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046216288	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:46,998 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046255166	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046255166	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:47,206 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(192)) - ugi=webuser,webgroup	ip=/158.1.130.33	cmd=listStatus	src=/hbase/.logs/C3S31,20020,1320034091400	perm=root:supergroup:rwxr-xr-x
2011-10-31 03:32:48,518 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046295501	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046295501	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:48,633 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(177)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=rename	src=/hbase/.logs/C3S31,20020,1320034091400/C3S31%2C20020%2C1320034091400.1320046325013	dst=/hbase/.oldlogs/C3S31%2C20020%2C1320034091400.1320046325013	perm=root:supergroup:rw-r--r--
2011-10-31 03:32:48,650 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(206)) - ugi=root,root,sfcb	ip=/158.1.130.31	cmd=delete	src=/hbase/.logs/C3S31,20020,1320034091400	
2011-10-31 03:32:49,389 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(206)) - ugi=root,root,sfcb	ip=/158.1.130.32	cmd=delete	src=/hbase/.META./1028785192/.tmp	


                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4695:
-------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Applied branches and trunk.  Thanks for the patch Gao.

I looked at this for a while to see if could make a test.  It'd be timing based thing where we'd check that logs were not moved before flush had completed.  Punting.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140960#comment-13140960 ] 

stack commented on HBASE-4695:
------------------------------

@Jack Thanks boss for taking it for a spin.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4695:
------------------------------

    Attachment: HBASE-4695_Trunk_V2.patch
    
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141754#comment-13141754 ] 

Hudson commented on HBASE-4695:
-------------------------------

Integrated in HBase-TRUNK #2397 (See [https://builds.apache.org/job/HBase-TRUNK/2397/])
    HBASE-4695  WAL logs get deleted before region server can fully flush

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139548#comment-13139548 ] 

Hadoop QA commented on HBASE-4695:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12501494/hbase-4695-0.92.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -166 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.master.TestDistributedLogSplitting

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/103//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/103//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/103//console

This message is automatically generated.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141822#comment-13141822 ] 

Hudson commented on HBASE-4695:
-------------------------------

Integrated in HBase-0.92 #95 (See [https://builds.apache.org/job/HBase-0.92/95/])
    HBASE-4695  WAL logs get deleted before region server can fully flush

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4695:
------------------------------

    Attachment: HBASE-4695_branch90_trial.patch

Go back to company and verify this patch.
If you are free, Please review it firstly.

The patch seems simple.

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "jack levin (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140918#comment-13140918 ] 

jack levin commented on HBASE-4695:
-----------------------------------

I confirmed that this fix is working on my cluster, when I shutdown
the region server, the logs are moved to .oldlogs only after the RS
fully shuts down.

-Jack

On Mon, Oct 31, 2011 at 7:20 PM, Hadoop QA (Commented) (JIRA)

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4695:
-------------------------

    Fix Version/s: 0.92.0
    
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140009#comment-13140009 ] 

gaojinchao commented on HBASE-4695:
-----------------------------------

@Ted
Our laboratory network failure, I verify the newest patch tomorrow.

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140297#comment-13140297 ] 

stack commented on HBASE-4695:
------------------------------

+1 on patch.  Looks good to me.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139532#comment-13139532 ] 

Ted Yu commented on HBASE-4695:
-------------------------------

Originally closeWAL(true) would be called even if this.fsOk is false. But now:
{code}
+    if (this.fsOk) {
       waitOnAllRegionsToClose(abortRequested);
+      if (!this.killed){
+        closeWAL(abortRequested ? false : true);
{code}
I think the call to closeWAL() should be placed outside if (this.fsOk) block.

The next step is to verify that the issue is really fixed.

Thanks for taking care of this, Jinchao.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139987#comment-13139987 ] 

gaojinchao commented on HBASE-4695:
-----------------------------------

@Ted
I think your patch has some problem. when this.fsOk is false, we clean the Hlog, the data will lose. 

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139553#comment-13139553 ] 

gaojinchao commented on HBASE-4695:
-----------------------------------

@Ted
Thanks for your reveiw.  
I considered if this.fsOk is false , calling closeWAL() makes no sense.
Your approach is same as original logic, that is ok, I will verify next monday.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140847#comment-13140847 ] 

gaojinchao commented on HBASE-4695:
-----------------------------------

@jack
I made a patch for branch90. This patch has a little change,you can merge this change to your version.
Because our network of laboratories alse failure, so I can't verify this. 
sorry.

                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139630#comment-13139630 ] 

Ted Yu commented on HBASE-4695:
-------------------------------

The test failures reported by HadoopQA were due to 'Too many open files'
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140044#comment-13140044 ] 

Hadoop QA commented on HBASE-4695:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12501582/HBASE-4695_Trunk_V2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 javadoc.  The javadoc tool appears to have generated -166 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.master.TestDistributedLogSplitting
                  org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
                  org.apache.hadoop.hbase.master.TestMasterFailover

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/110//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/110//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/110//console

This message is automatically generated.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "gaojinchao (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4695:
------------------------------

    Attachment: HBASE-4695_Branch90_V2.patch
    
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140852#comment-13140852 ] 

Hadoop QA commented on HBASE-4695:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12501720/HBASE-4695_Branch90_V2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/116//console

This message is automatically generated.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>         Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4695:
--------------------------

    Status: Patch Available  (was: Open)

Submit for patch testing.
                
> WAL logs get deleted before region server can fully flush
> ---------------------------------------------------------
>
>                 Key: HBASE-4695
>                 URL: https://issues.apache.org/jira/browse/HBASE-4695
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.90.4
>            Reporter: jack levin
>            Assignee: gaojinchao
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt
>
>
> To replicate the problem do the following:
> 1. check /hbase/.logs/XXXX directory to see if you have WAL logs for the region server you are shutting down.
> 2. executing kill <pid> (where pid is a regionserver pid)
> 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush:
> 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
> 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
> 4. Check /hbase/.logs/XXXX -- you will notice that it has dissapeared.
> 5. Check namenode logs:
> 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
> Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay.  We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira