You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "nkeywal (Created) (JIRA)" <ji...@apache.org> on 2012/01/10 01:43:45 UTC

[jira] [Created] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
-------------------------------------------------------------------------------------------------------------------------------------

                 Key: HBASE-5163
                 URL: https://issues.apache.org/jira/browse/HBASE-5163
             Project: HBase
          Issue Type: Bug
          Components: test
    Affects Versions: 0.94.0
         Environment: all
            Reporter: nkeywal
            Assignee: nkeywal
            Priority: Minor


The stack is typically:
{noformat}
    <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
        // ...
{noformat}

It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.

This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
{noformat}
   for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
      if (manageDfsDirs) {
        File dir1 = new File(data_dir, "data"+(2*i+1));
        File dir2 = new File(data_dir, "data"+(2*i+2));
        dir1.mkdirs();
        dir2.mkdirs();
      // [...]
{noformat}

This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
{noformat}
  @Test
  public void testMiniDFSCluster_startDataNode() throws Exception {

    assertTrue( dfsCluster.getDataNodes().size() == 2 );


    // Works, as we kill the last datanode, we can now start a datanode
    dfsCluster.stopDataNode(1);
    dfsCluster
      .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);


    // Fails, as it's not the last datanode, the directory will conflict on
    //  creation
    dfsCluster.stopDataNode(0);
    try {
      dfsCluster
        .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
      fail("There should be an exception because the directory already exists");
    } catch (IOException e) {
      assertTrue( e.getMessage().contains("The directory is already locked."));
      LOG.info("Expected (!) exception caught " + e.getMessage());
    }

    // Works, as we kill the last datanode, we can now restart 2 datanodes
    // This makes us back with 2 nodes
    dfsCluster.stopDataNode(0);
    dfsCluster
      .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
  }
{noformat}


And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
{noformat}
DatanodeInfo[] pipeline = getPipeline(log);
assertTrue(pipeline.length == fs.getDefaultReplication());
{noformat}

and then kill the datanodes in the pipeline, we will have:
 - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
 - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.

There are two ways of fixing the test:
1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.

Feedback welcome and the choice to make here...



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184366#comment-13184366 ] 

Hudson commented on HBASE-5163:
-------------------------------

Integrated in HBase-TRUNK-security #73 (See [https://builds.apache.org/job/HBase-TRUNK-security/73/])
    HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal)

tedyu : 
Files : 
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java

                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5163:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184113#comment-13184113 ] 

Zhihong Yu commented on HBASE-5163:
-----------------------------------

Integrated to TRUNK.

Thanks for the patch, N.

Thanks for the review, Stack.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184741#comment-13184741 ] 

Hudson commented on HBASE-5163:
-------------------------------

Integrated in HBase-0.92 #241 (See [https://builds.apache.org/job/HBase-0.92/241/])
    HBASE-5163  TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA
               ("The directory is already locked.") (N Keywal)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java

                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "nkeywal (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183655#comment-13183655 ] 

nkeywal commented on HBASE-5163:
--------------------------------

I've got a fix using a variation of #2. Tested 100 times without any failure.
The advantage of #1 for me is that it eliminates a quite tricky behavior, but the fix would then be outside HBase...

                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183682#comment-13183682 ] 

Zhihong Yu commented on HBASE-5163:
-----------------------------------

+1 on patch.
Looped over the test and didn't encounter test failure.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184701#comment-13184701 ] 

Hadoop QA commented on HBASE-5163:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510307/5163-92.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/742//console

This message is automatically generated.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5163:
------------------------------

    Summary: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")  (was: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked."))
    
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183487#comment-13183487 ] 

stack commented on HBASE-5163:
------------------------------

The lock issue happens in a bunch of tests IIRC.  Nice digging N.  Can #2 fail ever?  If so, I like #1.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184147#comment-13184147 ] 

Hudson commented on HBASE-5163:
-------------------------------

Integrated in HBase-TRUNK #2618 (See [https://builds.apache.org/job/HBase-TRUNK/2618/])
    HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal)

tedyu : 
Files : 
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java

                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5163:
---------------------------

    Status: Patch Available  (was: Open)
    
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5163:
---------------------------

    Attachment: 5163.patch
    
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182965#comment-13182965 ] 

Zhihong Yu commented on HBASE-5163:
-----------------------------------

Nice analysis.
I like #2 better. We should always make test scenario deterministic.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5163:
------------------------------

    Attachment: 5163-92.txt

Patch I would integrate to 0.92
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184788#comment-13184788 ] 

Hudson commented on HBASE-5163:
-------------------------------

Integrated in HBase-0.92-security #72 (See [https://builds.apache.org/job/HBase-0.92-security/72/])
    HBASE-5163  TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA
               ("The directory is already locked.") (N Keywal)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java

                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5163:
------------------------------

    Affects Version/s:     (was: 0.94.0)
                       0.92.0
        Fix Version/s: 0.94.0
                       0.92.0
         Hadoop Flags: Reviewed
    
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183908#comment-13183908 ] 

stack commented on HBASE-5163:
------------------------------

This is kinda funny:

{code}
-    Thread.sleep(10000);
+
{code}

given your comment above it "+    // This function is synchronous, when it returns the node is killed."

Good stuff.

(Woah, you get rid of two ten second pauses...)

+1 on commit.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183691#comment-13183691 ] 

Hadoop QA commented on HBASE-5163:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510114/5163.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated -147 warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.replication.TestReplication
                  org.apache.hadoop.hbase.mapreduce.TestImportTsv
                  org.apache.hadoop.hbase.mapred.TestTableMapReduce
                  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/723//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/723//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/723//console

This message is automatically generated.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira