You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Edward J. Yoon (JIRA)" <ji...@apache.org> on 2015/09/03 06:37:45 UTC

[jira] [Created] (HAMA-973) GraphJob and RandBench example works incorrectly when FT is enabled.

Edward J. Yoon created HAMA-973:
-----------------------------------

             Summary: GraphJob and RandBench example works incorrectly when FT is enabled.
                 Key: HAMA-973
                 URL: https://issues.apache.org/jira/browse/HAMA-973
             Project: Hama
          Issue Type: Bug
          Components: bsp core
    Affects Versions: 0.7.0
            Reporter: Edward J. Yoon
            Assignee: Edward J. Yoon
            Priority: Critical
             Fix For: 0.7.1


Today I tested fault tolerance function with RandBench. FT works fine but I just found that there is a bug in RandBench program.

{code}
[root@cluster-0 hama-0.7.0]# bin/hama jar hama-examples-0.7.0.jar bench 100 100 100
15/09/03 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/03 12:59:58 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/09/03 12:59:58 INFO bsp.BSPJobClient: Running job: job_201509031258_0002
15/09/03 13:00:01 INFO bsp.BSPJobClient: Current supersteps number: 0
15/09/03 13:00:22 INFO bsp.BSPJobClient: Current supersteps number: 2
15/09/03 13:00:26 INFO bsp.BSPJobClient: Current supersteps number: 5
15/09/03 13:00:29 INFO bsp.BSPJobClient: Current supersteps number: 11
15/09/03 13:00:32 INFO bsp.BSPJobClient: Current supersteps number: 16
15/09/03 13:00:35 INFO bsp.BSPJobClient: Current supersteps number: 21
15/09/03 13:00:38 INFO bsp.BSPJobClient: Current supersteps number: 28
15/09/03 13:00:41 INFO bsp.BSPJobClient: Current supersteps number: 35
15/09/03 13:00:44 INFO bsp.BSPJobClient: Current supersteps number: 42
15/09/03 13:00:47 INFO bsp.BSPJobClient: Current supersteps number: 49
15/09/03 13:00:50 INFO bsp.BSPJobClient: Current supersteps number: 56
15/09/03 13:02:05 INFO bsp.BSPJobClient: Current supersteps number: 0
15/09/03 13:02:08 INFO bsp.BSPJobClient: Current supersteps number: 56
15/09/03 13:02:11 INFO bsp.BSPJobClient: Current supersteps number: 0
15/09/03 13:02:20 INFO bsp.BSPJobClient: Current supersteps number: 57
15/09/03 13:02:23 INFO bsp.BSPJobClient: Current supersteps number: 61
15/09/03 13:02:26 INFO bsp.BSPJobClient: Current supersteps number: 67
15/09/03 13:02:29 INFO bsp.BSPJobClient: Current supersteps number: 72
15/09/03 13:02:32 INFO bsp.BSPJobClient: Current supersteps number: 77
15/09/03 13:02:35 INFO bsp.BSPJobClient: Current supersteps number: 84
15/09/03 13:02:38 INFO bsp.BSPJobClient: Current supersteps number: 91
15/09/03 13:02:41 INFO bsp.BSPJobClient: Current supersteps number: 97
15/09/03 13:02:44 INFO bsp.BSPJobClient: Current supersteps number: 106
15/09/03 13:02:47 INFO bsp.BSPJobClient: Current supersteps number: 113
15/09/03 13:02:50 INFO bsp.BSPJobClient: Current supersteps number: 125
15/09/03 13:02:53 INFO bsp.BSPJobClient: Current supersteps number: 134
15/09/03 13:02:56 INFO bsp.BSPJobClient: Current supersteps number: 144
15/09/03 13:02:59 INFO bsp.BSPJobClient: Current supersteps number: 152
15/09/03 13:03:02 INFO bsp.BSPJobClient: Current supersteps number: 156
15/09/03 13:03:05 INFO bsp.BSPJobClient: The total number of supersteps: 156
15/09/03 13:03:05 INFO bsp.BSPJobClient: Counters: 6
15/09/03 13:03:05 INFO bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
15/09/03 13:03:05 INFO bsp.BSPJobClient:     SUPERSTEPS=156
15/09/03 13:03:05 INFO bsp.BSPJobClient:     LAUNCHED_TASKS=160
15/09/03 13:03:05 INFO bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
15/09/03 13:03:05 INFO bsp.BSPJobClient:     SUPERSTEP_SUM=24960
15/09/03 13:03:05 INFO bsp.BSPJobClient:     TIME_IN_SYNC_MS=1943366
15/09/03 13:03:05 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=1600000
15/09/03 13:03:05 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=1600000
Job Finished in 187.453 seconds
{code}

I ran with set the max iteration to 100. At 56 superstep, I killed one task manually and I checked that failed task has automatically recovered. By the way, the total num of supersteps was 156, not 100.

The reason is simple, i always starts from 0. To fix this issue, we have to set the i to (int) peer.getSuperstepCount().

{code}
    public void bsp(
        BSPPeer<NullWritable, NullWritable, NullWritable, NullWritable, BytesWritable> peer)
        throws IOException, SyncException, InterruptedException {
      byte[] dummyData = new byte[sizeOfMsg];
      String[] peers = peer.getAllPeerNames();

      for (int i = 0; i < nSupersteps; i++) {
{code}

GraphJobRunner also have similar problem. When the task is relaunched, setup() method will be called. Below should be called only when initial phase.

{code}
    long startTime = System.currentTimeMillis();
    loadVertices(peer);
    LOG.info("Total time spent for loading vertices: "
        + (System.currentTimeMillis() - startTime) + " ms");

    startTime = System.currentTimeMillis();
    countGlobalVertexCount(peer);
    LOG.info("Total time spent for broadcasting global vertex count: "
        + (System.currentTimeMillis() - startTime) + " ms");

    startTime = System.currentTimeMillis();
    doInitialSuperstep(peer);
    LOG.info("Total time spent for initial superstep: "
        + (System.currentTimeMillis() - startTime) + " ms");
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)