You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Berry, Matt" <mw...@amazon.com> on 2012/07/20 02:28:37 UTC

Lines missing from output files (0.20.205.0)

I have a slightly modified Text Output Format that essentially writes each key into its own file. It operates off the premise that my reducer is an identity function and it emits each record one-by-one in the order they come from the collection. Because the records are emitted in order from the reducer, I can maintain one open output file and close it when a new key appears. The reason I am doing it like this instead of using MultipleOutputs is that I am locked into hadoop 0.20.205.0. 

The problem I am having is that I am randomly getting IOExceptions due to opening an existing file. There are two ways I imagine this could happen. (1) Reducer 1  emits a record for key A and then Reducer 2 emits a record for Key A. I'm certain this is not the case as the keys should all group together. (2) The records are emitted out of order from a single reducer (AAAA BBBB A) in which case the reducer would try to open A again.

What is perplexing me is that in addition to the output files for each key, each output format opens a log file. I am seeing an exception propagate out from the reducer, but no such error appears in my log file. Some sample code follows to clarify.

class ModifiedTextOutputFormat {

  public ModifiedTextOutputFormat() {
    createLogFile();
  }

  protected createOutputFile(name) {
    try {
     fs.create(name);
    } catch (Throwable t) {
      logFile.writeBytes("Information about the error");  // Here I log the error (although it is missing later)
      closeLogFile(); // Here I close that file to be certain the last line is flushed
      throw new IOException("Information about the error",t); // Here I throw an exception, which appears on stderr
    }
  }

  public write(Key k, Value v) {
    if(!k.toString.equals(current)) { 
     outputFile.close();
     createOutputFile(k.toString); 
   }
    outputFile.writeBytes(v.toString());
  }
}

RE: Lines missing from output files (0.20.205.0)

Posted by "Berry, Matt" <mw...@amazon.com>.
I was operating under the assumption that there is a 1:1 ratio between Reducers and OutputFormats, just as there is a 1:1 correspondence between partitions and Reducers. I found this isn't the case however. I created a file from each output format named in this manner : "OutputFormat_(Partition number of first record written)_(Random number chosen at creation). I had 10 reducers for my job and I found 13 files:

OutputFormat_0_16
OutputFormat_1_27
OutputFormat_1_85
OutputFormat_2_45
OutputFormat_3_72
OutputFormat_4_18
OutputFormat_5_75
OutputFormat_6_57
OutputFormat_6_91
OutputFormat_7_27
OutputFormat_7_46
OutputFormat_8_28
OutputFormat_9_21

My job requires all records with a given key be processed by the same OutputFormat object. Is there a way I can control the number of OutputFormats like how I can control the number of Reducers and the distribution of records across partitions?

Sincerely,
Matthew Berry

-----Original Message-----
From: Berry, Matt 
Sent: Thursday, July 19, 2012 5:29 PM
To: mapreduce-user@hadoop.apache.org
Subject: Lines missing from output files (0.20.205.0)

I have a slightly modified Text Output Format that essentially writes each key into its own file. It operates off the premise that my reducer is an identity function and it emits each record one-by-one in the order they come from the collection. Because the records are emitted in order from the reducer, I can maintain one open output file and close it when a new key appears. The reason I am doing it like this instead of using MultipleOutputs is that I am locked into hadoop 0.20.205.0. 

The problem I am having is that I am randomly getting IOExceptions due to opening an existing file. There are two ways I imagine this could happen. (1) Reducer 1  emits a record for key A and then Reducer 2 emits a record for Key A. I'm certain this is not the case as the keys should all group together. (2) The records are emitted out of order from a single reducer (AAAA BBBB A) in which case the reducer would try to open A again.

What is perplexing me is that in addition to the output files for each key, each output format opens a log file. I am seeing an exception propagate out from the reducer, but no such error appears in my log file. Some sample code follows to clarify.

class ModifiedTextOutputFormat {

  public ModifiedTextOutputFormat() {
    createLogFile();
  }

  protected createOutputFile(name) {
    try {
     fs.create(name);
    } catch (Throwable t) {
      logFile.writeBytes("Information about the error");  // Here I log the error (although it is missing later)
      closeLogFile(); // Here I close that file to be certain the last line is flushed
      throw new IOException("Information about the error",t); // Here I throw an exception, which appears on stderr
    }
  }

  public write(Key k, Value v) {
    if(!k.toString.equals(current)) { 
     outputFile.close();
     createOutputFile(k.toString); 
   }
    outputFile.writeBytes(v.toString());
  }
}