You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2011/08/25 12:43:48 UTC

Summary of problems with HAMA-413 and Discussion

Today, I tested all Hama examples on my cluster of 32 nodes, with 96
tasks. Pi and Serialized Printing examples were working fine but

1. Barrier Synchronizations are not working well (with a 'bench' example).
2. When an unexpected shutdown occurs, ZK nodes (which created by each
BSPPeer) will not be deleted. There's no way to clean them up before
reboot the server.
3. Graph examples are not working.
4. Too many reporting times between Groom and Master.
5. And, there are many code issues that can be improved.

1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
3, 4, and 5 issues are already started by ChiaHung Lin.

All issues around this should be fixed in HAMA-413? or, Should we just
commit HAMA-413?

Thanks.
-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Summary of problems with HAMA-413 and Discussion

Posted by "Edward J. Yoon" <ed...@apache.org>.

As you said, the groom (and its tasks) statuses must be periodically
reported for many reasons e.g., fault management, job progress report,
..., etc.

I've opened Jira ticket, HAMA-429 today, let's discuss them on it.

Please feel free to assign yourself, if you are willing to design and fix them.

On Tue, Aug 30, 2011 at 12:35 PM, ChiaHung Lin <ch...@nuk.edu.tw> wrote:
> From the jira log it shows that the committed patch lets bsp peer directly report status back to master. An issue we may need to consider right now is `how can we determine if a groom server fails?' With original mechanism we can allow groom server to manage tasks (bsp peer) and master takes care of groom servers. For instance, if a groom server fails, a master can reschedule all tasks specified on that groom server to other working one. With currently mechanism, the master, in addition to monitor the activity of groom servers, also needs to deal with bsp peer. Do we have some plans on this already?
>
> -----Original message-----
> From:Edward J. Yoon <ed...@apache.org>
> To:hama-dev@incubator.apache.org <ha...@incubator.apache.org>
> Date:Fri, 26 Aug 2011 15:11:56 +0900
> Subject:Re: Summary of problems with HAMA-413 and Discussion
>
> Okay.
>
> Sent from my iPhone
>
> On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <ch...@nuk.edu.tw> wrote:
>
>> The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to report its status back to master.
>>
>> +        umbilical.updateTaskStatusAndReport(taskid);
>>
>> +  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
>> ...
>> +    doReport(taskStatus);
>> +  }
>>
>> Is there any chance to revert back using a version that reports task status by GroomServer, so we can discuss based on that version? Just to ensure that the following issues are not the result derived from the code changed above.
>>
>> -----Original message-----
>> From:Edward J. Yoon <ed...@apache.org>
>> To:hama-dev@incubator.apache.org
>> Date:Thu, 25 Aug 2011 19:43:48 +0900
>> Subject:Summary of problems with HAMA-413 and Discussion
>>
>> Today, I tested all Hama examples on my cluster of 32 nodes, with 96
>> tasks. Pi and Serialized Printing examples were working fine but
>>
>> 1. Barrier Synchronizations are not working well (with a 'bench' example).
>> 2. When an unexpected shutdown occurs, ZK nodes (which created by each
>> BSPPeer) will not be deleted. There's no way to clean them up before
>> reboot the server.
>> 3. Graph examples are not working.
>> 4. Too many reporting times between Groom and Master.
>> 5. And, there are many code issues that can be improved.
>>
>> 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
>> 3, 4, and 5 issues are already started by ChiaHung Lin.
>>
>> All issues around this should be fixed in HAMA-413? or, Should we just
>> commit HAMA-413?
>>
>> Thanks.
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>
>>
>> --
>> ChiaHung Lin
>> Department of Information Management
>> National University of Kaohsiung
>> Taiwan
>
>
> --
> ChiaHung Lin
> Department of Information Management
> National University of Kaohsiung
> Taiwan
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Summary of problems with HAMA-413 and Discussion

Posted by ChiaHung Lin <ch...@nuk.edu.tw>.

From the jira log it shows that the committed patch lets bsp peer directly report status back to master. An issue we may need to consider right now is `how can we determine if a groom server fails?' With original mechanism we can allow groom server to manage tasks (bsp peer) and master takes care of groom servers. For instance, if a groom server fails, a master can reschedule all tasks specified on that groom server to other working one. With currently mechanism, the master, in addition to monitor the activity of groom servers, also needs to deal with bsp peer. Do we have some plans on this already? 

-----Original message-----
From:Edward J. Yoon <ed...@apache.org>
To:hama-dev@incubator.apache.org <ha...@incubator.apache.org>
Date:Fri, 26 Aug 2011 15:11:56 +0900
Subject:Re: Summary of problems with HAMA-413 and Discussion

Okay.

Sent from my iPhone

On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <ch...@nuk.edu.tw> wrote:

> The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to report its status back to master. 
> 
> +        umbilical.updateTaskStatusAndReport(taskid);
> 
> +  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
> ...
> +    doReport(taskStatus);
> +  }
> 
> Is there any chance to revert back using a version that reports task status by GroomServer, so we can discuss based on that version? Just to ensure that the following issues are not the result derived from the code changed above. 
> 
> -----Original message-----
> From:Edward J. Yoon <ed...@apache.org>
> To:hama-dev@incubator.apache.org
> Date:Thu, 25 Aug 2011 19:43:48 +0900
> Subject:Summary of problems with HAMA-413 and Discussion
> 
> Today, I tested all Hama examples on my cluster of 32 nodes, with 96
> tasks. Pi and Serialized Printing examples were working fine but
> 
> 1. Barrier Synchronizations are not working well (with a 'bench' example).
> 2. When an unexpected shutdown occurs, ZK nodes (which created by each
> BSPPeer) will not be deleted. There's no way to clean them up before
> reboot the server.
> 3. Graph examples are not working.
> 4. Too many reporting times between Groom and Master.
> 5. And, there are many code issues that can be improved.
> 
> 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
> 3, 4, and 5 issues are already started by ChiaHung Lin.
> 
> All issues around this should be fixed in HAMA-413? or, Should we just
> commit HAMA-413?
> 
> Thanks.
> -- 
> Best Regards, Edward J. Yoon
> @eddieyoon
> 
> 
> --
> ChiaHung Lin
> Department of Information Management
> National University of Kaohsiung
> Taiwan


--
ChiaHung Lin
Department of Information Management
National University of Kaohsiung
Taiwan

Re: Summary of problems with HAMA-413 and Discussion

Posted by "Edward J. Yoon" <ed...@apache.org>.

In the short term view, the problem of graph examples, is only just
that there's no way to get a list of all BSPPeers when creating a Job.

    ...
    Collection<String> activeGrooms = cluster.getActiveGroomNames().values();
    String[] grooms = activeGrooms.toArray(new String[activeGrooms.size()]);

    LOG.info("Starting data partitioning...");
    if (adjacencyList == null) {
      conf = (HamaConfiguration) partition(conf, adjacencyListPath, grooms);
    } else {
      conf = (HamaConfiguration) partitionExample(conf, adjacencyList, grooms);
    }
    LOG.info("Finished!");
    ...

Here's my suggestion.

At the moment, use only 1 task per groom for graph examples. Then, we
can simple fix them.

2011/8/28 Thomas Jungblut <th...@googlemail.com>:
>>
>> 1. Barrier Synchronizations are not working well (with a 'bench' example).
>
>
> We should definitely talk about alternatives to ZooKeepers barrier sync.
> I'm always +1 for homebrew code which always works, instead of not working
> framework stuff.
> Maybe we should ask on the ZK mailing list are what the guys thinking about
> why it is not working properly. Assuming that we coded correct, AFAIK we
> just took the example code right?:)
>
>
>> 3. Graph examples are not working.
>>
>
> Like already mentioned in some other threads, I don't see this working
> without framework support. [1]
> Miklos pointed me to the goldenorb sources[2], they have an I/O system which
> is built on top of Hadoop's RecordReader and Inputformats. He mentioned that
> he wanted to extend HAMA-409 with this.
> What do you think?
>
> [1] Another option would be the dynamic partition assignment with Zookeeper,
> Miklos used in HAMA-409.
> [2] https://github.com/raveldata/goldenorb
>
> 2011/8/26 Edward J. Yoon <ed...@apache.org>
>
>> Okay.
>>
>> Sent from my iPhone
>>
>> On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <ch...@nuk.edu.tw> wrote:
>>
>> > The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer
>> to report its status back to master.
>> >
>> > +        umbilical.updateTaskStatusAndReport(taskid);
>> >
>> > +  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
>> > ...
>> > +    doReport(taskStatus);
>> > +  }
>> >
>> > Is there any chance to revert back using a version that reports task
>> status by GroomServer, so we can discuss based on that version? Just to
>> ensure that the following issues are not the result derived from the code
>> changed above.
>> >
>> > -----Original message-----
>> > From:Edward J. Yoon <ed...@apache.org>
>> > To:hama-dev@incubator.apache.org
>> > Date:Thu, 25 Aug 2011 19:43:48 +0900
>> > Subject:Summary of problems with HAMA-413 and Discussion
>> >
>> > Today, I tested all Hama examples on my cluster of 32 nodes, with 96
>> > tasks. Pi and Serialized Printing examples were working fine but
>> >
>> > 1. Barrier Synchronizations are not working well (with a 'bench'
>> example).
>> > 2. When an unexpected shutdown occurs, ZK nodes (which created by each
>> > BSPPeer) will not be deleted. There's no way to clean them up before
>> > reboot the server.
>> > 3. Graph examples are not working.
>> > 4. Too many reporting times between Groom and Master.
>> > 5. And, there are many code issues that can be improved.
>> >
>> > 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
>> > 3, 4, and 5 issues are already started by ChiaHung Lin.
>> >
>> > All issues around this should be fixed in HAMA-413? or, Should we just
>> > commit HAMA-413?
>> >
>> > Thanks.
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>> >
>> >
>> > --
>> > ChiaHung Lin
>> > Department of Information Management
>> > National University of Kaohsiung
>> > Taiwan
>>
>
>
>
> --
> Thomas Jungblut
> Berlin
>
> mobile: 0170-3081070
>
> business: thomas.jungblut@testberichte.de
> private: thomas.jungblut@gmail.com
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Summary of problems with HAMA-413 and Discussion

Posted by Thomas Jungblut <th...@googlemail.com>.

>
> 1. Barrier Synchronizations are not working well (with a 'bench' example).


We should definitely talk about alternatives to ZooKeepers barrier sync.
I'm always +1 for homebrew code which always works, instead of not working
framework stuff.
Maybe we should ask on the ZK mailing list are what the guys thinking about
why it is not working properly. Assuming that we coded correct, AFAIK we
just took the example code right?:)


> 3. Graph examples are not working.
>

Like already mentioned in some other threads, I don't see this working
without framework support. [1]
Miklos pointed me to the goldenorb sources[2], they have an I/O system which
is built on top of Hadoop's RecordReader and Inputformats. He mentioned that
he wanted to extend HAMA-409 with this.
What do you think?

[1] Another option would be the dynamic partition assignment with Zookeeper,
Miklos used in HAMA-409.
[2] https://github.com/raveldata/goldenorb

2011/8/26 Edward J. Yoon <ed...@apache.org>

> Okay.
>
> Sent from my iPhone
>
> On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <ch...@nuk.edu.tw> wrote:
>
> > The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer
> to report its status back to master.
> >
> > +        umbilical.updateTaskStatusAndReport(taskid);
> >
> > +  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
> > ...
> > +    doReport(taskStatus);
> > +  }
> >
> > Is there any chance to revert back using a version that reports task
> status by GroomServer, so we can discuss based on that version? Just to
> ensure that the following issues are not the result derived from the code
> changed above.
> >
> > -----Original message-----
> > From:Edward J. Yoon <ed...@apache.org>
> > To:hama-dev@incubator.apache.org
> > Date:Thu, 25 Aug 2011 19:43:48 +0900
> > Subject:Summary of problems with HAMA-413 and Discussion
> >
> > Today, I tested all Hama examples on my cluster of 32 nodes, with 96
> > tasks. Pi and Serialized Printing examples were working fine but
> >
> > 1. Barrier Synchronizations are not working well (with a 'bench'
> example).
> > 2. When an unexpected shutdown occurs, ZK nodes (which created by each
> > BSPPeer) will not be deleted. There's no way to clean them up before
> > reboot the server.
> > 3. Graph examples are not working.
> > 4. Too many reporting times between Groom and Master.
> > 5. And, there are many code issues that can be improved.
> >
> > 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
> > 3, 4, and 5 issues are already started by ChiaHung Lin.
> >
> > All issues around this should be fixed in HAMA-413? or, Should we just
> > commit HAMA-413?
> >
> > Thanks.
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
> >
> > --
> > ChiaHung Lin
> > Department of Information Management
> > National University of Kaohsiung
> > Taiwan
>



-- 
Thomas Jungblut
Berlin

mobile: 0170-3081070

business: thomas.jungblut@testberichte.de
private: thomas.jungblut@gmail.com

Re: Summary of problems with HAMA-413 and Discussion

Posted by "Edward J. Yoon" <ed...@apache.org>.

Okay.

Sent from my iPhone

On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <ch...@nuk.edu.tw> wrote:

> The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to report its status back to master. 
> 
> +        umbilical.updateTaskStatusAndReport(taskid);
> 
> +  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
> ...
> +    doReport(taskStatus);
> +  }
> 
> Is there any chance to revert back using a version that reports task status by GroomServer, so we can discuss based on that version? Just to ensure that the following issues are not the result derived from the code changed above. 
> 
> -----Original message-----
> From:Edward J. Yoon <ed...@apache.org>
> To:hama-dev@incubator.apache.org
> Date:Thu, 25 Aug 2011 19:43:48 +0900
> Subject:Summary of problems with HAMA-413 and Discussion
> 
> Today, I tested all Hama examples on my cluster of 32 nodes, with 96
> tasks. Pi and Serialized Printing examples were working fine but
> 
> 1. Barrier Synchronizations are not working well (with a 'bench' example).
> 2. When an unexpected shutdown occurs, ZK nodes (which created by each
> BSPPeer) will not be deleted. There's no way to clean them up before
> reboot the server.
> 3. Graph examples are not working.
> 4. Too many reporting times between Groom and Master.
> 5. And, there are many code issues that can be improved.
> 
> 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
> 3, 4, and 5 issues are already started by ChiaHung Lin.
> 
> All issues around this should be fixed in HAMA-413? or, Should we just
> commit HAMA-413?
> 
> Thanks.
> -- 
> Best Regards, Edward J. Yoon
> @eddieyoon
> 
> 
> --
> ChiaHung Lin
> Department of Information Management
> National University of Kaohsiung
> Taiwan

Re: Summary of problems with HAMA-413 and Discussion

Posted by ChiaHung Lin <ch...@nuk.edu.tw>.

The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to report its status back to master. 

+        umbilical.updateTaskStatusAndReport(taskid);

+  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
...
+    doReport(taskStatus);
+  }

Is there any chance to revert back using a version that reports task status by GroomServer, so we can discuss based on that version? Just to ensure that the following issues are not the result derived from the code changed above. 

-----Original message-----
From:Edward J. Yoon <ed...@apache.org>
To:hama-dev@incubator.apache.org
Date:Thu, 25 Aug 2011 19:43:48 +0900
Subject:Summary of problems with HAMA-413 and Discussion

Today, I tested all Hama examples on my cluster of 32 nodes, with 96
tasks. Pi and Serialized Printing examples were working fine but

1. Barrier Synchronizations are not working well (with a 'bench' example).
2. When an unexpected shutdown occurs, ZK nodes (which created by each
BSPPeer) will not be deleted. There's no way to clean them up before
reboot the server.
3. Graph examples are not working.
4. Too many reporting times between Groom and Master.
5. And, there are many code issues that can be improved.

1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
3, 4, and 5 issues are already started by ChiaHung Lin.

All issues around this should be fixed in HAMA-413? or, Should we just
commit HAMA-413?

Thanks.
-- 
Best Regards, Edward J. Yoon
@eddieyoon


--
ChiaHung Lin
Department of Information Management
National University of Kaohsiung
Taiwan