You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Parker, Matthew - IS" <Ma...@exelisinc.com> on 2013/03/06 15:33:50 UTC
Running Terasort on Hadoop and Local File System

I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system using a single cloudera VM (cdh3u4).

I can generate the data just fine, but when running the sorting portion of the program, I get an error about not being able to find the _partition.lst file. It exists in the generated data directory.

Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst is created with the parent directory.

  public int run(String[] args) throws Exception {
       LOG.info("starting");
      JobConf job = (JobConf) getConf();
>>  Path inputDir = new Path(args[0]);
>>  inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
>>  Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
      URI partitionUri = new URI(partitionFile.toString() +
                               "#" + TeraInputFormat.PARTITION_FILENAME);
      TeraInputFormat.setInputPaths(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.setJobName("TeraSort");
      job.setJarByClass(TeraSort.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);
      job.setInputFormat(TeraInputFormat.class);
      job.setOutputFormat(TeraOutputFormat.class);
      job.setPartitionerClass(TotalOrderPartitioner.class);
      TeraInputFormat.writePartitionFile(job, partitionFile);
      DistributedCache.addCacheFile(partitionUri, job);
      DistributedCache.createSymlink(job);
      job.setInt("dfs.replication", 1);
      TeraOutputFormat.setFinalSync(job, true);
      JobClient.runJob(job);
      LOG.info("done");
      return 0;
  }

But in the configure method, the Path isn't created with the parent directory reference.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);
>>    Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }

I modified the code as follows, and now sorting portion of the Terasort test works using the general file system. I think the above code is a bug.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);

  >>  Path[] inputPaths = TeraInputFormat.getInputPaths(job);
  >>  Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME);

        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }


________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.