You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Edward J. Yoon (JIRA)" <ji...@apache.org> on 2011/05/13 03:07:47 UTC

[jira] [Created] (HAMA-387) Add task ID and superstep count informations to lock file

Add task ID and superstep count informations to lock file
---------------------------------------------------------

                 Key: HAMA-387
                 URL: https://issues.apache.org/jira/browse/HAMA-387
             Project: Hama
          Issue Type: Improvement
          Components: bsp
    Affects Versions: 0.2.0
            Reporter: Edward J. Yoon
             Fix For: 0.3.0


I think, the lock file must include:

 * the job ID
 * the task ID of the lock file owner
 * the current superstep count

to check ownership and validation.

Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113900#comment-13113900 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Well, in our groom code the enter and leave barrier methods are just two RPC calls. This is cleaner than the whole sync and notify of ZK Nodes.
In addition we have our own sync service, which can now keep track of Superstep and additional information if we want to keep it there. For example which tasks are currently within the barrier. So we don't need zookeeper at all.
And we possibly could de-register task, so we can adjust the number of tasks that are need to trip the barrier during runtime. So we could add another method which is some kind of waitToHalt(), which deregisters the task from the sync service.

Besides that, I think this is faster than ZK barrier sync.
So to summarize, we would have full control, it is our code, no dependency. It is cleaner and we can implement new features easier with it.

And I guess I take the sync code for the MR NG integration, just because it is its own service and I don't want to debug the BSPPeer barrier code.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107568#comment-13107568 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Oh, patch looks good! ChiaHung.

I'm going to test this patch now. 

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065764#comment-13065764 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

HAMA-387_v04.patch is still problematic.

Hanged when the job was started.

{code}
root@hnode01:/usr/local/src/hama-trunk# bin/hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar bench 2 2 200
11/07/15 17:00:05 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://hnode15:9000/tmp/hadoop-root/bsp/system/submit_157sln
11/07/15 17:00:06 INFO bsp.BSPJobClient: Running job: job_201107151659_0001
11/07/15 17:00:09 INFO bsp.BSPJobClient: Current supersteps number: 0

----
2011-07-15 17:32:25,637 INFO org.apache.hama.bsp.TaskRunner: attempt_201107151659_0001_000011_0 11/07/15 17:32:25 DEBUG bsp.BSPPeer: hnode08:61000 is in superstep 0
2011-07-15 17:32:25,638 INFO org.apache.hama.bsp.TaskRunner: attempt_201107151659_0001_000011_0 11/07/15 17:32:25 DEBUG bsp.BSPPeer: hnode15:61000 is in superstep 46
2011-07-15 17:32:25,639 INFO org.apache.hama.bsp.TaskRunner: attempt_201107151659_0001_000011_0 11/07/15 17:32:25 DEBUG bsp.BSPPeer: hnode06:61000 is in superstep 0
2011-07-15 17:32:25,640 INFO org.apache.hama.bsp.TaskRunner: attempt_201107151659_0001_000011_0 11/07/15 17:32:25 DEBUG bsp.BSPPeer: hnode10:61000 is in superstep 0
2011-07-15 17:32:25,641 INFO org.apache.hama.bsp.TaskRunner: attempt_201107151659_0001_000011_0 11/07/15 17:32:25 DEBUG bsp.BSPPeer: hnode16:61000 is in superstep 0
2011-07-15 17:32:25,642 INFO org.apache.hama.bsp.TaskRunner: attempt_201107151659_0001_000011_0 11/07/15 17:32:25 DEBUG bsp.BSPPeer: [cnode05:61000] enter the enterbarrier: 0
{code}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037787#comment-13037787 ] 

Thomas Jungblut edited comment on HAMA-387 at 5/23/11 5:27 PM:
---------------------------------------------------------------

Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "/" + superstep + "/" + groom
EDIT taskid is not needed in the path...

Or is this violating the zookeeper?

EDIT2: this won't work, because zookeeper could not find the peers then, but it was a good idea.

      was (Author: thomas.jungblut):
    Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "/" + superstep + "/" + groom
EDIT taskid is not needed in the path...

Or is this violating the zookeeper?
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Attachment: HAMA-387_v02.patch

My solution is also simple.

By checking superstep number in leaveBarrier(), above problem can be fixed. (But I didn't test yet)

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065785#comment-13065785 ] 

Edward J. Yoon edited comment on HAMA-387 at 7/15/11 9:02 AM:
--------------------------------------------------------------

I think we should change the sync mechanism fundamentally.

As we can see the diagram[1], some task can be slower than others.

If user have a large cluster, "zk.getChildren()" will be called by huge number of peers.

{code}
    while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
{code}

1. https://issues.apache.org/jira/secure/attachment/12486573/x.PNG

      was (Author: udanax):
    I think we should change the sync mechanism fundamentally.

As we can see the diagram, some task can be slower than others.

If user have a large cluster, "zk.getChildren()" will be called by huge number of peers.

{code}
    while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
{code}
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037785#comment-13037785 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

The reason is very simple. Before a last peer leaving the barrier, other peers are starting to create their lock-file.

{code}
    zk.delete(bspRoot + "/" + getPeerName(), 0);

    while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
        if (list.size() > 0) {
          mutex.wait();
        } else {
          LOG.debug("[" + getPeerName() + "] leave from the leaveBarrier");
          return true;
        }
      }
    }
{code}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064288#comment-13064288 ] 

Hudson commented on HAMA-387:
-----------------------------

Integrated in Hama-Patch #345 (See [https://builds.apache.org/job/Hama-Patch/345/])
    Committing temporary solution of HAMA-387

edwardyoon : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1137857
Files : 
* /incubator/hama/trunk/src/java/org/apache/hama/bsp/BSPPeer.java


> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon reassigned HAMA-387:
-----------------------------------

    Assignee: Edward J. Yoon

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113852#comment-13113852 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Thomas,

Can you explain the advantages and disadvantages of your patch?

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113900#comment-13113900 ] 

Thomas Jungblut edited comment on HAMA-387 at 9/24/11 6:00 AM:
---------------------------------------------------------------

Well, in our bsppeer code the enter and leave barrier methods are just two RPC calls. This is cleaner than the whole sync and notify of ZK Nodes.
In addition we have our own sync service, which can now keep track of Superstep and additional information if we want to keep it there. For example which tasks are currently within the barrier. So we don't need zookeeper at all.
And we possibly could de-register task, so we can adjust the number of tasks that are need to trip the barrier during runtime. So we could add another method which is some kind of waitToHalt(), which deregisters the task from the sync service.

Besides that, I think this is faster than ZK barrier sync.
So to summarize, we would have full control, it is our code, no dependency. It is cleaner and we can implement new features easier with it.

And I guess I take the sync code for the MR NG integration, just because it is its own service and I don't want to debug the BSPPeer barrier code.

      was (Author: thomas.jungblut):
    Well, in our groom code the enter and leave barrier methods are just two RPC calls. This is cleaner than the whole sync and notify of ZK Nodes.
In addition we have our own sync service, which can now keep track of Superstep and additional information if we want to keep it there. For example which tasks are currently within the barrier. So we don't need zookeeper at all.
And we possibly could de-register task, so we can adjust the number of tasks that are need to trip the barrier during runtime. So we could add another method which is some kind of waitToHalt(), which deregisters the task from the sync service.

Besides that, I think this is faster than ZK barrier sync.
So to summarize, we would have full control, it is our code, no dependency. It is cleaner and we can implement new features easier with it.

And I guess I take the sync code for the MR NG integration, just because it is its own service and I don't want to debug the BSPPeer barrier code.
  
> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037787#comment-13037787 ] 

Thomas Jungblut edited comment on HAMA-387 at 5/23/11 8:19 AM:
---------------------------------------------------------------

Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "/" + superstep
EDIT taskid is not needed in the path...

Or is this violating the zookeeper?

      was (Author: thomas.jungblut):
    Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "_" + taskID  + "/" + superstep

Or is this violating the zookeeper?
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Attachment: x.PNG

I think we should change the sync mechanism fundamentally.

As we can see the diagram, some task can be slower than others.

If user have a large cluster, "zk.getChildren()" will be called by huge number of peers.

{code}
    while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
{code}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038473#comment-13038473 ] 

Thomas Jungblut edited comment on HAMA-387 at 5/24/11 10:16 AM:
----------------------------------------------------------------

Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D

EDIT:
We actually have to set the superstep count into the byte value of this lock. Then we have to get the object and deserialize it then to check in which superstep the node is...

{noformat}

private int countGroomsInSuperStep(List<String> list, long superStep) throws KeeperException, InterruptedException{
    int count = 0;
    for(String groom : list){
      byte[] data = zk.getData(bspRoot + "/" + groom, null, null);
      if(Bytes.toLong(data) == superStep)
        count++;
    }
    return count;
  }

{noformat}

And the loop is then going like:
{noformat}
 while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
        if (countGroomsInSuperStep(list,this.getSuperstepCount()) > 0) {
          mutex.wait();
        } else {
          LOG.debug("[" + getPeerName() + "] leave from the leaveBarrier");
          return true;
        }
      }
    }
{noformat}

      was (Author: thomas.jungblut):
    Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D

EDIT:
We actually have to set the superstep count into the byte value of this lock. Then we have to get the object and deserialize it then to check in which superstep the node is...

This is really crappy, we should open a feature ticket on the ZK project.
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon" <ed...@apache.org>.
Haha yes.

+1

Sent from my iPhone

On 2011. 7. 15., at 오후 6:10, "Thomas Jungblut (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065791#comment-13065791 ] 
> 
> Thomas Jungblut commented on HAMA-387:
> --------------------------------------
> 
> You just took the code from zookeepers barrier sync example right?:D
> 
> I would prefer a self-written solution as stated in the first post..
> 
>> Add task ID and superstep count informations to lock file
>> ---------------------------------------------------------
>> 
>>                Key: HAMA-387
>>                URL: https://issues.apache.org/jira/browse/HAMA-387
>>            Project: Hama
>>         Issue Type: Improvement
>>         Components: bsp
>>   Affects Versions: 0.3.0
>>           Reporter: Edward J. Yoon
>>           Assignee: Edward J. Yoon
>>            Fix For: 0.4.0
>> 
>>        Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>> 
>> 
>> I think, the lock file must include:
>> * the job ID
>> * the task ID of the lock file owner
>> * the current superstep count
>> to check ownership and validation.
>> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 
> 
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065791#comment-13065791 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

You just took the code from zookeepers barrier sync example right?:D

I would prefer a self-written solution as stated in the first post..

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104496#comment-13104496 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

I see several exceptions, but everything runs fine. (Ubuntu x64 in pseudo distributed mode with 3 tasks).

For superstep 3

{noformat} 
2011-09-14 15:26:18,573 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /bsp/job_201109141522_0001/3
2011-09-14 15:26:18,573 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
2011-09-14 15:26:18,573 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-14 15:26:18,574 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
2011-09-14 15:26:18,574 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.hama.bsp.BSPPeer.enterBarrier(BSPPeer.java:394)
2011-09-14 15:26:18,574 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.hama.bsp.BSPPeer.sync(BSPPeer.java:309)
2011-09-14 15:26:18,574 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:80)
2011-09-14 15:26:18,574 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-14 15:26:18,574 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000000_0         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)
2011-09-14 15:26:18,575 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0 11/09/14 15:26:18 WARN bsp.BSPPeer: Ignore for JobID/superstepcount znode is created.
2011-09-14 15:26:18,575 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /bsp/job_201109141522_0001/3
2011-09-14 15:26:18,575 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
2011-09-14 15:26:18,575 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-14 15:26:18,575 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
2011-09-14 15:26:18,575 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.hama.bsp.BSPPeer.enterBarrier(BSPPeer.java:394)
2011-09-14 15:26:18,576 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.hama.bsp.BSPPeer.sync(BSPPeer.java:309)
2011-09-14 15:26:18,576 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:80)
2011-09-14 15:26:18,576 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-14 15:26:18,576 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000001_0         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)
2011-09-14 15:26:18,578 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0001_000002_0 11/09/14 15:26:18 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():2 children in the list:[attempt_201109141522_0001_000000_0, attempt_20110
{noformat}

for superstep 999

{noformat} 
2011-09-14 15:34:58,655 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000000_0 before enterBarrier() 
2011-09-14 15:34:58,655 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():2 children in the list:[attempt_201109141522_0002_000002_0, attempt_201109141522_0002_000000_0]
2011-09-14 15:34:58,655 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():3 children in the list:[attempt_201109141522_0002_000002_0, attempt_201109141522_0002_000000_0, attempt_201109141522_0002_000001_0]
2011-09-14 15:34:58,656 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000000_0 after enterBarrier() 
2011-09-14 15:34:58,656 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():3 children in the list:[attempt_201109141522_0002_000002_0, attempt_201109141522_0002_000000_0, attempt_201109141522_0002_000001_0]
2011-09-14 15:34:58,656 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000001_0 after enterBarrier() 
2011-09-14 15:34:58,656 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000002_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():3 children in the list:[attempt_201109141522_0002_000002_0, attempt_201109141522_0002_000000_0, attempt_201109141522_0002_000001_0]
2011-09-14 15:34:58,656 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000002_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000002_0 after enterBarrier() 
2011-09-14 15:34:58,857 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000000_0 before leaveBarrier() 
2011-09-14 15:34:58,858 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000001_0 before leaveBarrier() 
2011-09-14 15:34:58,859 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109141522_0002_000002_0]
2011-09-14 15:34:58,859 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109141522_0002_000002_0]
2011-09-14 15:34:58,894 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000002_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000002_0 before leaveBarrier() 
2011-09-14 15:34:58,895 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000002_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:0 children in the list[]
2011-09-14 15:34:58,895 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000002_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000002_0 after leaveBarrier() 
2011-09-14 15:34:58,895 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:0 children in the list[]
2011-09-14 15:34:58,895 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:0 children in the list[]
2011-09-14 15:34:58,896 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000000_0 after leaveBarrier() 
2011-09-14 15:34:58,896 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:58 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000001_0 after leaveBarrier() 

{noformat}

and

{noformat}
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:59 INFO bsp.BSPPeer: =====> jobid:job_201109141522_0002 taskid:attempt_201109141522_0002_000000_0 before enterBarrier() 
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 11/09/14 15:34:59 WARN bsp.BSPPeer: Ignore for JobID/superstepcount znode is created.
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /bsp/job_201109141522_0002/999
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-14 15:34:59,101 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.hama.bsp.BSPPeer.enterBarrier(BSPPeer.java:394)
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.hama.bsp.BSPPeer.sync(BSPPeer.java:309)
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.hama.examples.RandBench$RandBSP.bsp(RandBench.java:67)
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000001_0 11/09/14 15:34:59 INFO examples.RandBench$RandBSP: ubuntu.ubuntu-domain:61001 to ubuntu.ubuntu-domain:61001 : 512
2011-09-14 15:34:59,102 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141522_0002_000000_0         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)
{noformat}

Is this helpful for you?

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104619#comment-13104619 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

I tested it with my 5 VMs. It is stable for bench example (args: 1048576 200 1000). Great work! :D

But I faced a bug which is going to cause the SSSP example to fail:

{noformat}

2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0 java.lang.ArrayIndexOutOfBoundsException: 1
2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0         at org.apache.hama.bsp.BSPPeer.getAddress(BSPPeer.java:509)
2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0         at org.apache.hama.bsp.BSPPeer.send(BSPPeer.java:279)
2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0         at org.apache.hama.examples.graph.ShortestPaths.sendMessageToNeighbors(ShortestPaths.java:157)
2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0         at org.apache.hama.examples.graph.ShortestPaths.bsp(ShortestPaths.java:66)
2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-14 18:12:55,986 INFO org.apache.hama.bsp.TaskRunner: attempt_201109141809_0001_000001_0         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)

{noformat}

ArrayIndexOutOfBoundsException is some serious business.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051112#comment-13051112 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

you forgot the import statement in the patch:
import org.apache.hama.util.Bytes;

But it works fine, at least the tests passed.


> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104464#comment-13104464 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Strange that this works, I checked it with the unit test and it hung.
I can offer just a 5vm cluster, too. But I can test the Shortest Paths.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065766#comment-13065766 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

attach more log.

{code}
root@hnode01:/usr/local/src/hama-trunk# bin/hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar bench 2 2 200
11/07/15 17:00:05 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://hnode15:9000/tmp/hadoop-root/bsp/system/submit_157sln
11/07/15 17:00:06 INFO bsp.BSPJobClient: Running job: job_201107151659_0001
11/07/15 17:00:09 INFO bsp.BSPJobClient: Current supersteps number: 0


^Croot@hnode01:/usr/local/src/hama-trunk# bin/hama job -kill job_201107151659_0001
Killed job job_201107151659_0001
root@hnode01:/usr/local/src/hama-trunk# bin/stop-bspd.sh
stopping bspmaster
hnode2: stopping groom
hnode10: stopping groom
hnode5: stopping groom
hnode3: stopping groom
hnode7: stopping groom
hnode9: stopping groom
hnode1: stopping groom
hnode13: stopping groom
hnode4: stopping groom
hnode6: stopping groom
hnode8: stopping groom
hnode16: stopping groom
hnode15: stopping groom
hnode11: stopping groom
hnode12: stopping groom
hnode1: stopping zookeeper
root@hnode01:/usr/local/src/hama-trunk# date
Fri Jul 15 17:07:04 KST 2011
{code}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109384#comment-13109384 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Patch looks good.

I'll test and report tomorrow. :-)

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051131#comment-13051131 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Hmm crap.

Can we add a testcase, this should be easily reproducable?

And what if we prevent peers from entering the barrier if the zookeeper lock still exists?
For example like this:

{noformat}
 protected boolean enterBarrier() throws KeeperException, InterruptedException {
    LOG.debug("[" + getPeerName() + "] enter the enterbarrier");
    try {
      while (zk.exists(bspRoot + "/" + getPeerName(), false) != null) {
        Thread.sleep(500L);
      }
      zk.create(bspRoot + "/" + getPeerName(),
          Bytes.toBytes(this.getSuperstepCount()), Ids.OPEN_ACL_UNSAFE,
          CreateMode.EPHEMERAL);
    } catch (KeeperException e) {
      LOG.error("Exception while entering barrier!", e);
    } catch (InterruptedException e) {
      LOG.error("Exception while entering barrier!", e);
    }
// etc omitted ...
{noformat}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon reassigned HAMA-387:
-----------------------------------

    Assignee: ChiaHung Lin  (was: Edward J. Yoon)

I realized that the issue is almost fixed by ChiaHung's x.patch. 

So I've just assigned this issue to ChiaHung.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-387:
---------------------------------

    Attachment: HAMA-387_v03.patch

I've coded what I told before in the last post. 
This works fine for me.

Would you mind to test this?

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Affects Version/s:     (was: 0.2.0)
                       0.3.0
        Fix Version/s:     (was: 0.3.0)
                       0.4.0

I need more time to test.

So, Re-scheduling to 0.4.

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051112#comment-13051112 ] 

Thomas Jungblut edited comment on HAMA-387 at 6/17/11 2:53 PM:
---------------------------------------------------------------

-you forgot the import statement in the patch:-
-import org.apache.hama.util.Bytes;-
No it's actually there ;D

But it works fine, at least the tests passed.


      was (Author: thomas.jungblut):
    -you forgot the import statement in the patch:
import org.apache.hama.util.Bytes;-
No it's actually there ;D

But it works fine, at least the tests passed.

  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113131#comment-13113131 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

When executing, does it hang? This error only indicates the /ready znode has already been removed (`WARN bsp.BSPPeer: Ignore because znode may be deleted.'). It does not prevent the computation to proceed further. So if it hangs, there may have some other exceptions. 

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113844#comment-13113844 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

{code}
So if it hangs, there may have some other exceptions.
{code}

Yes, it was another problem.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Summary: Advanced Barrier Synchronization  (was: Add task ID and superstep count informations to lock file)

I'm renaming this issue title to "Advanced Barrier Synchronization".

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ChiaHung Lin updated HAMA-387:
------------------------------

    Attachment:     (was: conditional_wait.patch)

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, conditional_wait.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-387:
---------------------------------

    Attachment: ownSyncService.patch

Integrated my barrier sync proposal.

Unit test works. Can we try this at larger scale?

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112275#comment-13112275 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

It works well. Let's commit this!

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107582#comment-13107582 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

{code}
2011-09-19 10:15:40,004 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: >>> 41
2011-09-19 10:15:40,004 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: =====> jobid:job_201109190955_0007 taskid:attempt_201109190955_0007_000037_0 after enterBarrier()
2011-09-19 10:15:40,213 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000036_0 11/09/19 10:15:40 INFO bsp.BSPPeer: =====> jobid:job_201109190955_0007 taskid:attempt_201109190955_0007_000036_0 before leaveBarrier()
2011-09-19 10:15:40,214 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: =====> jobid:job_201109190955_0007 taskid:attempt_201109190955_0007_000037_0 before leaveBarrier()
2011-09-19 10:15:40,267 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000036_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:5 children in the list[attempt_201109190955_0007_000006_0, attempt_201109190955_0007_000038_0, attempt_201109190955_0007_000014_0, attempt_201109190955_0007_000002_0, attempt_201109190955_0007_000003_0]
2011-09-19 10:15:40,267 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:5 children in the list[attempt_201109190955_0007_000006_0, attempt_201109190955_0007_000038_0, attempt_201109190955_0007_000014_0, attempt_201109190955_0007_000002_0, attempt_201109190955_0007_000003_0]
2011-09-19 10:15:40,361 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000036_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:4 children in the list[attempt_201109190955_0007_000038_0, attempt_201109190955_0007_000014_0, attempt_201109190955_0007_000002_0, attempt_201109190955_0007_000003_0]
2011-09-19 10:15:40,361 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:4 children in the list[attempt_201109190955_0007_000038_0, attempt_201109190955_0007_000014_0, attempt_201109190955_0007_000002_0, attempt_201109190955_0007_000003_0]
2011-09-19 10:15:40,377 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:3 children in the list[attempt_201109190955_0007_000038_0, attempt_201109190955_0007_000014_0, attempt_201109190955_0007_000003_0]
2011-09-19 10:15:40,378 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000036_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:3 children in the list[attempt_201109190955_0007_000038_0, attempt_201109190955_0007_000014_0, attempt_201109190955_0007_000003_0]
2011-09-19 10:15:40,410 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000037_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109190955_0007_000038_0]
2011-09-19 10:15:40,411 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000036_0 11/09/19 10:15:40 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109190955_0007_000038_0]
2011-09-19 10:15:41,355 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000038_0 11/09/19 10:15:41 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():1 children in the list:[attempt_201109190955_0007_000038_0]
2011-09-19 10:15:41,355 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190955_0007_000038_0 11/09/19 10:15:41 INFO bsp.BSPPeer: >>> 41
{code}

superstep count seems not increased.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037788#comment-13037788 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

Does the cnode14 eventually enters the 98th superstep? From the log, it seems like cnode14 is going to enter the 98th superstep (but not yet log information). My understanding is that barrier synchronization would wait all processes reach the barrier then proceed. Therefore, if cnode14 log `enter the 98 barrier' later on, all nodes then leave barrier; such result looks ok. 

Also, a quick look at the patch shows that the creation of znode is EPHEMERAL instead of EPHEMERAL_SEQUENTIAL; this eliminates the issues that clients process disconnects and then reconnect scenario that leads to the name appended with a monotonically increasing number.   


> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Attachment: new.patch

I'm attach another patch.

This is a quite tricky solution, but if this works on more large cluster I'll committed this.

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059634#comment-13059634 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Thanks, I'll test today.

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107647#comment-13107647 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

FYI,

{code}
root@Cnode1:/usr/local/src/hama-trunk# core/bin/start-bspd.sh
hnode1: starting zookeeper, logging to /usr/local/src/hama-trunk/core/bin/../logs/hama-root-zookeeper-Cnode1.out
starting bspmaster, logging to /usr/local/src/hama-trunk/core/bin/../logs/hama-root-bspmaster-Cnode1.out
2011-09-19 15:33:21.206::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2011-09-19 15:33:21.247::INFO:  jetty-6.1.14
2011-09-19 15:33:21.435::INFO:  Started SelectChannelConnector@hnode1:40013
org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /bsp/job_201109191247_0001
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:116)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
        at org.apache.hama.bsp.BSPMaster.clearZKNodes(BSPMaster.java:477)
        at org.apache.hama.bsp.BSPMaster.initZK(BSPMaster.java:469)
        at org.apache.hama.bsp.BSPMaster.startMaster(BSPMaster.java:431)
{code}

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Attachment: sleepless.patch

This patch removes Thread.sleep() call.

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113079#comment-13113079 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

{code}

2011-09-23 09:42:34,466 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0 11/09/23 09:42:34 DEBUG bsp.BSPPeer: enterBarrier() znode size within /bsp/job_201109230911_0002/71 is 45. Znodes include [attempt_201109230911_0002_000046_0, attempt_201109230911_0002_000006_0, attempt_201109230911_0002_000005_0, attempt_201109230911_0002_000030_0, attempt_201109230911_0002_000000_0, attempt_201109230911_0002_000026_0, attempt_201109230911_0002_000025_0, attempt_201109230911_0002_000007_0, attempt_201109230911_0002_000024_0, attempt_201109230911_0002_000014_0, attempt_201109230911_0002_000021_0, attempt_201109230911_0002_000045_0, attempt_201109230911_0002_000015_0, attempt_201109230911_0002_000035_0, attempt_201109230911_0002_000020_0, attempt_201109230911_0002_000016_0, attempt_201109230911_0002_000044_0, attempt_201109230911_0002_000009_0, attempt_201109230911_0002_000017_0, attempt_201109230911_0002_000008_0, attempt_201109230911_0002_000011_0, attempt_201109230911_0002_000037_0, attempt_201109230911_0002_000004_0, attempt_201109230911_0002_000043_0, attempt_201109230911_0002_000022_0, attempt_201109230911_0002_000012_0, attempt_201109230911_0002_000019_0, attempt_201109230911_0002_000039_0, attempt_201109230911_0002_000034_0, attempt_201109230911_0002_000036_0, attempt_201109230911_0002_000027_0, attempt_201109230911_0002_000018_0, attempt_201109230911_0002_000033_0, attempt_201109230911_0002_000023_0, attempt_201109230911_0002_000029_0, attempt_201109230911_0002_000013_0, attempt_201109230911_0002_000003_0, attempt_201109230911_0002_000031_0, attempt_201109230911_0002_000028_0, attempt_201109230911_0002_000040_0, attempt_201109230911_0002_000001_0, attempt_201109230911_0002_000042_0, attempt_201109230911_0002_000047_0, attempt_201109230911_0002_000002_0, attempt_201109230911_0002_000032_0]
2011-09-23 09:42:34,482 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0 11/09/23 09:42:34 DEBUG bsp.BSPPeer: enterBarrier() znode size within /bsp/job_201109230911_0002/71 is 47. Znodes include [attempt_201109230911_0002_000010_0, attempt_201109230911_0002_000046_0, attempt_201109230911_0002_000006_0, attempt_201109230911_0002_000005_0, attempt_201109230911_0002_000030_0, attempt_201109230911_0002_000000_0, attempt_201109230911_0002_000026_0, attempt_201109230911_0002_000025_0, attempt_201109230911_0002_000007_0, attempt_201109230911_0002_000024_0, attempt_201109230911_0002_000014_0, attempt_201109230911_0002_000021_0, attempt_201109230911_0002_000045_0, attempt_201109230911_0002_000015_0, attempt_201109230911_0002_000035_0, attempt_201109230911_0002_000020_0, attempt_201109230911_0002_000016_0, attempt_201109230911_0002_000044_0, attempt_201109230911_0002_000009_0, attempt_201109230911_0002_000017_0, attempt_201109230911_0002_000008_0, attempt_201109230911_0002_000011_0, attempt_201109230911_0002_000037_0, attempt_201109230911_0002_000004_0, attempt_201109230911_0002_000043_0, attempt_201109230911_0002_000022_0, attempt_201109230911_0002_000012_0, attempt_201109230911_0002_000019_0, attempt_201109230911_0002_000039_0, attempt_201109230911_0002_000034_0, attempt_201109230911_0002_000036_0, attempt_201109230911_0002_000027_0, attempt_201109230911_0002_000018_0, attempt_201109230911_0002_000033_0, attempt_201109230911_0002_000023_0, attempt_201109230911_0002_000029_0, attempt_201109230911_0002_000013_0, attempt_201109230911_0002_000003_0, attempt_201109230911_0002_000031_0, attempt_201109230911_0002_000028_0, attempt_201109230911_0002_000040_0, attempt_201109230911_0002_000001_0, attempt_201109230911_0002_000041_0, attempt_201109230911_0002_000042_0, attempt_201109230911_0002_000047_0, attempt_201109230911_0002_000002_0, attempt_201109230911_0002_000032_0]
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0 11/09/23 09:42:34 WARN bsp.BSPPeer: Ignore because znode may be deleted.
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/job_201109230911_0002/71/ready
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0 11/09/23 09:42:34 WARN bsp.BSPPeer: Ignore because znode may be deleted.
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/job_201109230911_0002/71/ready
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0         at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0         at org.apache.hama.bsp.BSPPeer$1.process(BSPPeer.java:397)
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000045_0         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
2011-09-23 09:42:34,507 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
2011-09-23 09:42:34,508 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-23 09:42:34,508 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0         at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
2011-09-23 09:42:34,508 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0         at org.apache.hama.bsp.BSPPeer$1.process(BSPPeer.java:397)
2011-09-23 09:42:34,508 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000046_0         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0 11/09/23 09:42:34 WARN bsp.BSPPeer: Ignore because znode may be deleted.
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/job_201109230911_0002/71/ready
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0         at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0         at org.apache.hama.bsp.BSPPeer$1.process(BSPPeer.java:397)
2011-09-23 09:42:34,516 INFO org.apache.hama.bsp.TaskRunner: attempt_201109230911_0002_000047_0         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
{code}

Problem occurred on my testbed, too.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037787#comment-13037787 ] 

Thomas Jungblut edited comment on HAMA-387 at 5/23/11 8:30 AM:
---------------------------------------------------------------

Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "/" + superstep + "/" + groom
EDIT taskid is not needed in the path...

Or is this violating the zookeeper?

      was (Author: thomas.jungblut):
    Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "/" + superstep
EDIT taskid is not needed in the path...

Or is this violating the zookeeper?
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032967#comment-13032967 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Just a short question.
Why do we stick with lock files instead of rpc calls and handling locks on the master?

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051112#comment-13051112 ] 

Thomas Jungblut edited comment on HAMA-387 at 6/17/11 2:53 PM:
---------------------------------------------------------------

-you forgot the import statement in the patch:
import org.apache.hama.util.Bytes;-
No it's actually there ;D

But it works fine, at least the tests passed.


      was (Author: thomas.jungblut):
    you forgot the import statement in the patch:
import org.apache.hama.util.Bytes;

But it works fine, at least the tests passed.

  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-387:
---------------------------------

    Attachment: ownSyncService_v2.patch

fixed checkpointer testcase

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037741#comment-13037741 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Hmm, it's still problematic.

{code}
cnode11:
2011-05-23 14:25:21,731 DEBUG org.apache.hama.bsp.BSPPeer: Send bytes ([B@28f19d6e) to cnode14.ucloud:61000
2011-05-23 14:25:21,732 DEBUG org.apache.hama.bsp.BSPPeer: [cnode11.cloud:61000] enter the 98 barrier

cnode12:
2011-05-23 14:34:11,527 DEBUG org.apache.hama.bsp.BSPPeer: Send bytes ([B@53ea0105) to cnode2.ucloud:61000
2011-05-23 14:34:11,528 DEBUG org.apache.hama.bsp.BSPPeer: [cnode12.cloud:61000] enter the 98 barrier

cnode13:
2011-05-23 14:27:39,965 DEBUG org.apache.hama.bsp.BSPPeer: Local send bytes ([B@528a52b6)
2011-05-23 14:27:39,966 DEBUG org.apache.hama.bsp.BSPPeer: [cnode13.cloud:61000] enter the 98 barrier

cnode14:
2011-05-23 14:19:24,116 DEBUG org.apache.hama.bsp.BSPPeer: [cnode14.cloud:61000] enter the 97 barrier
2011-05-23 14:19:24,136 DEBUG org.apache.hama.bsp.BSPPeer: [cnode14.cloud:61000] leave the 97 barrier
{code}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037806#comment-13037806 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

>> Does the cnode14 eventually enters the 98th superstep?

Nope, 
Hmm, it's very hard to explain..

{code}
    zk.delete(bspRoot + "/" + getPeerName(), 0); // If this is the last one, 

    // Other peers are starting to call enterBarrier() method.
    // Because why? (list.size() == 0) is true.

    while (true) {  // And hang forever.
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
        if (list.size() > 0) {
          mutex.wait();
        } else {
          LOG.debug("[" + getPeerName() + "] leave from the leaveBarrier");
          return true;
        }
      }
    }
{code}

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ChiaHung Lin updated HAMA-387:
------------------------------

    Attachment: x.patch

Any chance to help test the patch and paste the log so I can check its internal state? Basically just to know when sync() hangs, within super step what process remains there. I've tested this patch with 5 vms and it works ok, but I understand this is probably because the processes involved are not large enough. 

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-387:
---------------------------------

    Attachment: ownSyncService_v3.patch

After copying of xml configurations from hdfs to local and local to hdfs it works in pseudo distributed mode.
I hope it will work on distributed mode.

Testcases are working, bench in pseudo distributed mode, too.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051119#comment-13051119 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Thanks for your review. BTW, it seems still problematic.


> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: HAMA-387_v02.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103352#comment-13103352 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

Sorry I miss the patch v4. Any chance to change zk path (e.g. in enterBarrier function) to bspRoot+"/"+superstep+"/"+taskId instead of checking the pool with all task mixed up together. 

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ChiaHung Lin updated HAMA-387:
------------------------------

    Attachment: conditional_wait.patch

The new patch may solve the following issues, and one that perhaps the root cause in which a task attaching the watcher for monitoring /ready may not be notified because of unconditional wait. Can anyone help test if sync() still hang with this patch? 

{code}
2011-09-24 15:44:33,644 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 11/09/24 15:44:33 ERROR bsp.BSPTask: Exception during BSP execution!
2011-09-24 15:44:33,644 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/job_201109241540_0001/4/attempt_201109241540_0001_000005_0
2011-09-24 15:44:33,644 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
2011-09-24 15:44:33,644 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-24 15:44:33,644 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
2011-09-24 15:44:33,645 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.hama.bsp.BSPPeer.leaveBarrier(BSPPeer.java:437)
2011-09-24 15:44:33,645 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.hama.bsp.BSPPeer.sync(BSPPeer.java:335)
2011-09-24 15:44:33,645 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:80)
2011-09-24 15:44:33,645 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-24 15:44:33,645 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)
2011-09-24 15:44:33,657 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 11/09/24 15:44:33 INFO zookeeper.ZooKeeper: Session: 0x3329a6008840001 closed
2011-09-24 15:44:33,657 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 11/09/24 15:44:33 INFO ipc.Server: Stopping server on 61002
2011-09-24 15:44:33,657 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
2011-09-24 15:44:33,657 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 log4j:WARN Please initialize the log4j system properly.
2011-09-24 15:44:33,657 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241540_0001_000005_0 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{code}

{code}
2011-09-24 16:29:09,521 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 11/09/24 16:29:09 WARN bsp.BSPPeer: Ignore because znode may be already created at /bsp/job_201109241626_0001/0
2011-09-24 16:29:09,522 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /bsp/job_201109241626_0001/0
2011-09-24 16:29:09,522 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
2011-09-24 16:29:09,522 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-24 16:29:09,522 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
2011-09-24 16:29:09,522 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPPeer.createZnode(BSPPeer.java:367)
2011-09-24 16:29:09,524 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPPeer.createZnode(BSPPeer.java:354)
2011-09-24 16:29:09,524 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPPeer.enterBarrier(BSPPeer.java:388)
2011-09-24 16:29:09,524 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPPeer.sync(BSPPeer.java:308)
2011-09-24 16:29:09,524 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:66)
2011-09-24 16:29:09,524 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-24 16:29:09,524 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)
2011-09-24 16:29:09,673 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 11/09/24 16:29:09 ERROR bsp.BSPTask: Exception during BSP execution!
2011-09-24 16:29:09,673 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/job_201109241626_0001/0
2011-09-24 16:29:09,673 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1271)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPPeer.enterBarrier(BSPPeer.java:411)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPPeer.sync(BSPPeer.java:308)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:66)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
2011-09-24 16:29:09,674 INFO org.apache.hama.bsp.TaskRunner: attempt_201109241626_0001_000011_0 	at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:940)
{code}

Test result output: 
{code}
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 22:49:24 INFO bsp.BSPJobClient: Running job: job_201109242248_0001
11/09/24 22:49:27 INFO bsp.BSPJobClient: Current supersteps number: 0
...
11/09/24 22:57:02 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 22:57:10 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.1428666666666665
Job Finished in 472.434 seconds
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 22:57:20 INFO bsp.BSPJobClient: Running job: job_201109242248_0002
11/09/24 22:57:23 INFO bsp.BSPJobClient: Current supersteps number: 0
...
11/09/24 23:03:12 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 23:03:25 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.1447999999999996
Job Finished in 368.786 seconds
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 23:04:27 INFO bsp.BSPJobClient: Running job: job_201109242248_0003
11/09/24 23:04:30 INFO bsp.BSPJobClient: Current supersteps number: 0
...
1/09/24 23:10:50 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 23:11:02 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.144633333333333
Job Finished in 398.859 seconds
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 23:15:26 INFO bsp.BSPJobClient: Running job: job_201109242248_0004
11/09/24 23:15:29 INFO bsp.BSPJobClient: Current supersteps number: 0
...
11/09/24 23:20:40 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 23:20:50 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.1455999999999995
Job Finished in 331.478 seconds
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 23:21:03 INFO bsp.BSPJobClient: Running job: job_201109242248_0005
11/09/24 23:21:06 INFO bsp.BSPJobClient: Current supersteps number: 0
...
11/09/24 23:26:41 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 23:26:48 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.1420333333333335
Job Finished in 350.252 seconds
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 23:27:02 INFO bsp.BSPJobClient: Running job: job_201109242248_0006
11/09/24 23:27:05 INFO bsp.BSPJobClient: Current supersteps number: 0
...
11/09/24 23:32:36 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 23:32:48 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.1422000000000003
Job Finished in 352.089 seconds
$ hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
11/09/24 23:35:35 INFO bsp.BSPJobClient: Running job: job_201109242248_0007
11/09/24 23:35:38 INFO bsp.BSPJobClient: Current supersteps number: 0
...
11/09/24 23:41:00 INFO bsp.BSPJobClient: Current supersteps number: 101
11/09/24 23:41:07 INFO bsp.BSPJobClient: The total number of supersteps: 101
Estimated value of PI is 3.139433333333333
Job Finished in 336.643 seconds
{code}


> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, conditional_wait.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037787#comment-13037787 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Ah I see :) 
Why don't we pick a bspRoot that represents the jobID + superstep.
The layout could be then:

bspRoot + "/" + jobID + "_" + taskID  + "/" + superstep

Or is this violating the zookeeper?

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038473#comment-13038473 ] 

Thomas Jungblut edited comment on HAMA-387 at 5/24/11 10:02 AM:
----------------------------------------------------------------

Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D

EDIT:
We actually have to set the superstep count into the byte value of this lock. Then we have to get the object and deserialize it then to check in which superstep the node is...

This is really crappy, we should open a feature ticket on the ZK project.

      was (Author: thomas.jungblut):
    Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103551#comment-13103551 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

{quote}
Sorry I miss the patch v4. Any chance to change zk path (e.g. in enterBarrier function) to bspRoot+"/"superstep"/"+taskId instead of checking the pool with all task mixed up together
{quote}

Same problem as it was before. At least I don't know what to do anymore.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107570#comment-13107570 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Job hangs again in the patch test.

{code}
root@Cnode1:/usr/local/src/hama-trunk# core/bin/hama jar examples/target/hama-exampleSNAPSHOT.jar bench 160 10000 64
11/09/19 09:34:31 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://hnode15:9/bsp/system/submit_z5c7vt
11/09/19 09:34:31 INFO bsp.BSPJobClient: Running job: job_201109190912_0005
11/09/19 09:34:34 INFO bsp.BSPJobClient: Current supersteps number: 0
11/09/19 09:34:40 INFO bsp.BSPJobClient: Current supersteps number: 1
11/09/19 09:34:43 INFO bsp.BSPJobClient: Current supersteps number: 3
11/09/19 09:34:46 INFO bsp.BSPJobClient: Current supersteps number: 5
11/09/19 09:34:49 INFO bsp.BSPJobClient: Current supersteps number: 6
11/09/19 09:34:52 INFO bsp.BSPJobClient: Current supersteps number: 8
11/09/19 09:34:55 INFO bsp.BSPJobClient: Current supersteps number: 10
11/09/19 09:34:58 INFO bsp.BSPJobClient: Current supersteps number: 12
11/09/19 09:35:01 INFO bsp.BSPJobClient: Current supersteps number: 13
11/09/19 09:35:04 INFO bsp.BSPJobClient: Current supersteps number: 14

----

2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():45 children in the list:[attempt_201109190912_0005_000020_0, attempt_201109190912_0005_000005_0, attempt_201109190912_0005_000030_0, attempt_201109190912_0005_000021_0, attempt_201109190912_0005_000023_0, attempt_201109190912_0005_000004_0, attempt_201109190912_0005_000010_0, attempt_201109190912_0005_000014_0, attempt_201109190912_0005_000015_0, attempt_201109190912_0005_000039_0, attempt_201109190912_0005_000006_0, attempt_201109190912_0005_000007_0, attempt_201109190912_0005_000019_0, attempt_201109190912_0005_000044_0, attempt_201109190912_0005_000024_0, attempt_201109190912_0005_000013_0, attempt_201109190912_0005_000025_0, attempt_201109190912_0005_000016_0, attempt_201109190912_0005_000034_0, attempt_201109190912_0005_000042_0, attempt_201109190912_0005_000026_0, attempt_201109190912_0005_000035_0, attempt_201109190912_0005_000008_0, attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000033_0, attempt_201109190912_0005_000009_0, attempt_201109190912_0005_000002_0, attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000036_0, attempt_201109190912_0005_000012_0, attempt_201109190912_0005_000003_0, attempt_201109190912_0005_000011_0, attempt_201109190912_0005_000038_0, attempt_201109190912_0005_000029_0, attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000040_0, attempt_201109190912_0005_000017_0, attempt_201109190912_0005_000043_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0, attempt_201109190912_0005_000001_0, attempt_201109190912_0005_000031_0, attempt_201109190912_0005_000037_0, attempt_201109190912_0005_000022_0, attempt_201109190912_0005_000032_0]
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000005_0 after enterBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000003_0 after enterBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000005_0 before leaveBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:11 children in the list[attempt_201109190912_0005_000007_0, attempt_201109190912_0005_000044_0, attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000009_0, attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000003_0, attempt_201109190912_0005_000011_0, attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000001_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():11 children in the list:[attempt_201109190912_0005_000007_0, attempt_201109190912_0005_000044_0, attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000009_0, attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000003_0, attempt_201109190912_0005_000011_0, attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,617 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000003_0 before leaveBarrier()
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:3 children in the list[attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000001_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():3 children in the list:[attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:3 children in the list[attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000001_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():1 children in the list:[attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109190912_0005_000001_0]
{code}

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103541#comment-13103541 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

I'll test this. But I have actually tried something equal with bspRoot/jobid/superstep.
But let's see.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032995#comment-13032995 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Oh, "lock file" is a znode. It's different with some file of local disk, because they are stored in memory.

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038446#comment-13038446 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Another idea of mine:

What if we name the "lock files" with the superstep number. Like 2_peerName.
Instead of checking the size we are iterating over the child names and counting the numbers that using the prefixes of the superstep we are currently in. 

Like we are having:
/bsp/98_cnode14
/bsp/98_cnode13
/bsp/98_cnode12
/bsp/99_cnode11
/bsp/99_cnode10

11 and 10 proceeded for what reason at all to superstep 99, but the others are in 98. 
Currently:
We will never leave 98 on 14,13 and 12 because list.size() is always > 0 because the others won't get removed.
With the solution:
14,13 and 12 can leave the superstep 98 because we are just counting the prefixes instead of the files itself.

Is this possible?

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065768#comment-13065768 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Hmm, do you have other ideas?

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ChiaHung Lin updated HAMA-387:
------------------------------

    Attachment: HAMA-387.patch

The file attached follows zookeeper's double barrier recipe for the issue that some processes go too fast deleting its znode before others are allowed to enter the barrier. 

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105871#comment-13105871 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

That is great :)
Thank you soo much, I'm glad that this is working now.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038473#comment-13038473 ] 

Thomas Jungblut edited comment on HAMA-387 at 5/27/11 3:27 PM:
---------------------------------------------------------------

Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D

EDIT:
We actually have to set the superstep count into the byte value of this lock. Then we have to get the object and deserialize it then to check in which superstep the node is...

{noformat}
  private int countGroomsInSuperStep(List<String> list, long superStep)
      throws InterruptedException {
    int count = 0;
    for (String groom : list) {
      byte[] data = null;
      try {
        data = zk.getData(bspRoot + "/" + groom, null, null);
      } catch (KeeperException e) {
        LOG.warn("Exception in sync phase of SuperStep " + superStep, e);
      }
      if (data != null && Bytes.toLong(data) == superStep)
        count++;
    }
    return count;
  }

{noformat}

And the loop is then going like:
{noformat}
 while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
        if (countGroomsInSuperStep(list,this.getSuperstepCount()) > 0) {
          mutex.wait();
        } else {
          LOG.debug("[" + getPeerName() + "] leave from the leaveBarrier");
          return true;
        }
      }
    }
{noformat}

I'm not quite sure if this works out well..

      was (Author: thomas.jungblut):
    Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D

EDIT:
We actually have to set the superstep count into the byte value of this lock. Then we have to get the object and deserialize it then to check in which superstep the node is...

{noformat}

private int countGroomsInSuperStep(List<String> list, long superStep) throws KeeperException, InterruptedException{
    int count = 0;
    for(String groom : list){
      byte[] data = zk.getData(bspRoot + "/" + groom, null, null);
      if(Bytes.toLong(data) == superStep)
        count++;
    }
    return count;
  }

{noformat}

And the loop is then going like:
{noformat}
 while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
        if (countGroomsInSuperStep(list,this.getSuperstepCount()) > 0) {
          mutex.wait();
        } else {
          LOG.debug("[" + getPeerName() + "] leave from the leaveBarrier");
          return true;
        }
      }
    }
{noformat}
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114466#comment-13114466 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

{quote}
I don't see hang problem. Let's just put own barrier service on the back burner for now.
{quote}

That's okay. If you have some time, I would be interested in a randbench result.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, conditional_wait.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Jungblut updated HAMA-387:
---------------------------------

    Attachment: HAMA-387_v04.patch

Actually we should check for supersteps that are lower than the current one.

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114185#comment-13114185 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

I'll test your patches tomorrow. 
Thanks for all.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, conditional_wait.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ChiaHung Lin updated HAMA-387:
------------------------------

    Attachment: conditional_wait.patch

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, conditional_wait.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103311#comment-13103311 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

I see there are two issues here. 

First, sync() may hang - from the information I received, the problem seemingly comes from the superstep as we discussed last time. Have we tested this already? Or any more detail information (e.g steps to reproduce this problem) so others can help test if adding superstep would fix the problem.

Second, long running process - it seems to me this issue is more related to performance issue (not showstopper.) It probably can be improved by making use of message tree[1] or scheduling tasks with roughly equal computation load. 

Personally I think the first problem is more important and we should fix it first. 

[1] Practical Barrier Synchronisation. ftp://ftp.comlab.ox.ac.uk/pub/Documents/techpapers/Jonathan.Hill/HillSkill_barrier.ps.Z


> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065785#comment-13065785 ] 

Edward J. Yoon edited comment on HAMA-387 at 7/15/11 9:03 AM:
--------------------------------------------------------------

I think we should change the sync mechanism fundamentally.

As we can see the diagram[1], some task can be slower than others.

If user have a large cluster, "zk.getChildren()" will be called in a loop by huge number of peers.

{code}
    while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
{code}

1. https://issues.apache.org/jira/secure/attachment/12486573/x.PNG

      was (Author: udanax):
    I think we should change the sync mechanism fundamentally.

As we can see the diagram[1], some task can be slower than others.

If user have a large cluster, "zk.getChildren()" will be called by huge number of peers.

{code}
    while (true) {
      synchronized (mutex) {
        List<String> list = zk.getChildren(bspRoot, true);
{code}

1. https://issues.apache.org/jira/secure/attachment/12486573/x.PNG
  
> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037767#comment-13037767 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Sleep was a good hack for 75% of all cases. But I'm wondering why this sync of zookeeper is not working like it actually should do.

If nothing is working we should just setup a Cyclic barrier on the bspmaster and trigger it with an RPC call...

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114424#comment-13114424 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

I don't see hang problem. Let's just put own barrier service on the back burner for now.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, conditional_wait.patch, doublebarrier.patch, new.patch, ownSyncService.patch, ownSyncService_v2.patch, ownSyncService_v3.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105044#comment-13105044 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

If I am correct, that looks like originally we do not deal with KeeperException.NodeExistsException, which means znode proposed has already been created. We have several GroomServers starting to create znode (e.g. JobId/superstep/TaskId) on zookeeper; therefore, it is possible to have 2 (or more) BSPPeers writing the same znode in the scene similar to check-then-act scenario. For example, 2 BSPPeers check (zk.exists(path)) if znode path exists or not simultaneously, then they decide to create the znode (zk.create(path...)) because the Stat returned is null, indicating no znode exists. Unfortunately, one BSPPeer is writing fast than the other, resulting in that the second BSPPeer fails in creating znode because znode exists. Thus all computation hangs because `list.size() < jobConf.getNumBspTask()' is always true in while loop. 

For the ArrayIndexOutOfBoundsException, it seems the parameter peerName, which should be encoded like host:port (in getAddress() peerName is split by `:' into an array), in BSPPeer.send() function is malformed. 


> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038459#comment-13038459 ] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

I think, it also good idea.

I'll testing it, Thanks!

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon updated HAMA-387:
--------------------------------

    Attachment: doublebarrier.patch

- Removed Thread.sleep()
- Fixed BSPMaster initZK() bug.

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Add task ID and superstep count informations to lock file

Posted by "Thomas Jungblut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038473#comment-13038473 ] 

Thomas Jungblut commented on HAMA-387:
--------------------------------------

Won't work. 

{noformat}
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/1_thomas-desktop:56492
{noformat}

ZooKeeper sucks?:D

> Add task ID and superstep count informations to lock file
> ---------------------------------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>             Fix For: 0.3.0
>
>         Attachments: sleepless.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HAMA-387) Advanced Barrier Synchronization

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edward J. Yoon resolved HAMA-387.
---------------------------------

    Resolution: Fixed

I've just committed this. Thanks ChiaHung!

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: ChiaHung Lin
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387.patch, HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, doublebarrier.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HAMA-387) Advanced Barrier Synchronization

Posted by "ChiaHung Lin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107591#comment-13107591 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

Any chance help check why other tasks do not come joining the barrier sync (are they still executing or any exception, except NodeExistsException, is thrown)? From the last line in the log output it shows only 1 task `attempt_201109190955_0007_000038_0' enter barrier. 

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira