You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Travis Chung <jt...@gmail.com> on 2016/08/02 11:53:43 UTC

FileSplit clarification

I wanted to get clarification on the start parameter. If I understand
correctly, it's the byte offset from the beginning of the file.

/** Constructs a split with host information
   *
   * @param file the file name
   * @param start the position of the first byte in the file to process
   * @param length the number of bytes in the file to process
   * @param hosts the list of hosts containing the block, possibly null
   */
  public FileSplit(Path file, long start, long length, String[] hosts)

In Hadoop RecordReader blog
<https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/>,
he creates a custom RecordReader and checks to see if he needs to skip the
first line (assuming it's been processed by the previous split).

Why would he need to skip the first line if getStart() already points to
the beginning of the current split?

In initialize() of CustomRecordReader:

// Split "S" is responsible for all records
// starting from "start" and "end" positions
start = split.getStart();
end = start + split.getLength();

// Retrieve file containing Split "S"
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());

// If Split "S" starts at byte 0, first line will be processed
// If Split "S" does not start at byte 0, first line has been already
// processed by "S-1" and therefore needs to be silently ignored
boolean skipFirstLine = false;
if (start != 0) {
    skipFirstLine = true;
    // Set the file pointer at "start - 1" position.
    // This is to make sure we won't miss any line
    // It could happen if "start" is located on a EOL
    --start;
    fileIn.seek(start);
}