You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Edward J. Yoon (JIRA)" <ji...@apache.org> on 2015/09/03 06:37:45 UTC
[jira] [Created] (HAMA-973) GraphJob and RandBench example works
incorrectly when FT is enabled.
Edward J. Yoon created HAMA-973:
-----------------------------------
Summary: GraphJob and RandBench example works incorrectly when FT is enabled.
Key: HAMA-973
URL: https://issues.apache.org/jira/browse/HAMA-973
Project: Hama
Issue Type: Bug
Components: bsp core
Affects Versions: 0.7.0
Reporter: Edward J. Yoon
Assignee: Edward J. Yoon
Priority: Critical
Fix For: 0.7.1
Today I tested fault tolerance function with RandBench. FT works fine but I just found that there is a bug in RandBench program.
{code}
[root@cluster-0 hama-0.7.0]# bin/hama jar hama-examples-0.7.0.jar bench 100 100 100
15/09/03 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/03 12:59:58 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/09/03 12:59:58 INFO bsp.BSPJobClient: Running job: job_201509031258_0002
15/09/03 13:00:01 INFO bsp.BSPJobClient: Current supersteps number: 0
15/09/03 13:00:22 INFO bsp.BSPJobClient: Current supersteps number: 2
15/09/03 13:00:26 INFO bsp.BSPJobClient: Current supersteps number: 5
15/09/03 13:00:29 INFO bsp.BSPJobClient: Current supersteps number: 11
15/09/03 13:00:32 INFO bsp.BSPJobClient: Current supersteps number: 16
15/09/03 13:00:35 INFO bsp.BSPJobClient: Current supersteps number: 21
15/09/03 13:00:38 INFO bsp.BSPJobClient: Current supersteps number: 28
15/09/03 13:00:41 INFO bsp.BSPJobClient: Current supersteps number: 35
15/09/03 13:00:44 INFO bsp.BSPJobClient: Current supersteps number: 42
15/09/03 13:00:47 INFO bsp.BSPJobClient: Current supersteps number: 49
15/09/03 13:00:50 INFO bsp.BSPJobClient: Current supersteps number: 56
15/09/03 13:02:05 INFO bsp.BSPJobClient: Current supersteps number: 0
15/09/03 13:02:08 INFO bsp.BSPJobClient: Current supersteps number: 56
15/09/03 13:02:11 INFO bsp.BSPJobClient: Current supersteps number: 0
15/09/03 13:02:20 INFO bsp.BSPJobClient: Current supersteps number: 57
15/09/03 13:02:23 INFO bsp.BSPJobClient: Current supersteps number: 61
15/09/03 13:02:26 INFO bsp.BSPJobClient: Current supersteps number: 67
15/09/03 13:02:29 INFO bsp.BSPJobClient: Current supersteps number: 72
15/09/03 13:02:32 INFO bsp.BSPJobClient: Current supersteps number: 77
15/09/03 13:02:35 INFO bsp.BSPJobClient: Current supersteps number: 84
15/09/03 13:02:38 INFO bsp.BSPJobClient: Current supersteps number: 91
15/09/03 13:02:41 INFO bsp.BSPJobClient: Current supersteps number: 97
15/09/03 13:02:44 INFO bsp.BSPJobClient: Current supersteps number: 106
15/09/03 13:02:47 INFO bsp.BSPJobClient: Current supersteps number: 113
15/09/03 13:02:50 INFO bsp.BSPJobClient: Current supersteps number: 125
15/09/03 13:02:53 INFO bsp.BSPJobClient: Current supersteps number: 134
15/09/03 13:02:56 INFO bsp.BSPJobClient: Current supersteps number: 144
15/09/03 13:02:59 INFO bsp.BSPJobClient: Current supersteps number: 152
15/09/03 13:03:02 INFO bsp.BSPJobClient: Current supersteps number: 156
15/09/03 13:03:05 INFO bsp.BSPJobClient: The total number of supersteps: 156
15/09/03 13:03:05 INFO bsp.BSPJobClient: Counters: 6
15/09/03 13:03:05 INFO bsp.BSPJobClient: org.apache.hama.bsp.JobInProgress$JobCounter
15/09/03 13:03:05 INFO bsp.BSPJobClient: SUPERSTEPS=156
15/09/03 13:03:05 INFO bsp.BSPJobClient: LAUNCHED_TASKS=160
15/09/03 13:03:05 INFO bsp.BSPJobClient: org.apache.hama.bsp.BSPPeerImpl$PeerCounter
15/09/03 13:03:05 INFO bsp.BSPJobClient: SUPERSTEP_SUM=24960
15/09/03 13:03:05 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=1943366
15/09/03 13:03:05 INFO bsp.BSPJobClient: TOTAL_MESSAGES_SENT=1600000
15/09/03 13:03:05 INFO bsp.BSPJobClient: TOTAL_MESSAGES_RECEIVED=1600000
Job Finished in 187.453 seconds
{code}
I ran with set the max iteration to 100. At 56 superstep, I killed one task manually and I checked that failed task has automatically recovered. By the way, the total num of supersteps was 156, not 100.
The reason is simple, i always starts from 0. To fix this issue, we have to set the i to (int) peer.getSuperstepCount().
{code}
public void bsp(
BSPPeer<NullWritable, NullWritable, NullWritable, NullWritable, BytesWritable> peer)
throws IOException, SyncException, InterruptedException {
byte[] dummyData = new byte[sizeOfMsg];
String[] peers = peer.getAllPeerNames();
for (int i = 0; i < nSupersteps; i++) {
{code}
GraphJobRunner also have similar problem. When the task is relaunched, setup() method will be called. Below should be called only when initial phase.
{code}
long startTime = System.currentTimeMillis();
loadVertices(peer);
LOG.info("Total time spent for loading vertices: "
+ (System.currentTimeMillis() - startTime) + " ms");
startTime = System.currentTimeMillis();
countGlobalVertexCount(peer);
LOG.info("Total time spent for broadcasting global vertex count: "
+ (System.currentTimeMillis() - startTime) + " ms");
startTime = System.currentTimeMillis();
doInitialSuperstep(peer);
LOG.info("Total time spent for initial superstep: "
+ (System.currentTimeMillis() - startTime) + " ms");
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)