You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Johannes.Lichtenberger" <Jo...@uni-konstanz.de> on 2010/10/21 00:09:53 UTC

Splitsize...

Hello,

I'm currently not sure what happens to records, when the block-size is
reached. Let's assume a block-size of 128MB and my XMLRecordReader
implementation splits on a specific element.

I assume what's happening is, that the records, which are less than 128
MB are all processed by one Mapper and the record, which would yield to
more than 128MB would be processed by the next Mapper!?

I'm using FileSplit, which means the File isn't splitted after 128MB of
data and maybe a fixed number of records is going to every Mapper (maybe
less on the "last" Mapper)!?

Thus I don't know regarding my RecordReader if I really need the start
and end of a file split:

    @Override
    public void initialize(final InputSplit paramGenericSplit, final
TaskAttemptContext paramContext)
        throws IOException {
        final FileSplit split = (FileSplit)paramGenericSplit;
        final Configuration conf = paramContext.getConfiguration();

        mStart = split.getStart();
        mEnd = mStart + split.getLength();
        mValue = new Text();
        mKey = new DateWritable();
        mWriter = new StringWriter();
        try {
            mEventWriter =
XMLOutputFactory.newInstance().createXMLEventWriter(mWriter);
        } catch (final XMLStreamException e) {
            LOGWRAPPER.error(e.getMessage(), e);
        } catch (final FactoryConfigurationError e) {
            LOGWRAPPER.error(e.getMessage(), e);
        }

        final Path file = split.getPath();

        // Open the file and seek to the start of the split.
        final FileSystem fileSys = file.getFileSystem(conf);
        final FSDataInputStream fileIn = fileSys.open(split.getPath());
        fileIn.seek(mStart);

        final CompressionCodecFactory comprCodecs = new
CompressionCodecFactory(conf);
        final CompressionCodec codec = comprCodecs.getCodec(file);

        InputStream input = fileIn;
        if (codec != null) {
            input = codec.createInputStream(fileIn);
            mEnd = Long.MAX_VALUE;
        }
        input = new BufferedInputStream(input);
        ...
}

For example is the seek(mStart) really needed? I think I can safely
remove mStart and mEnd, because the StAX-Parser I'm using is just
parsing from the start to the end of the file (START_DOCUMENT /
END_DOCUMENT).

regards,
Johannes