You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2016/02/17 09:37:18 UTC
[jira] [Assigned] (MAPREDUCE-6635) Unsafe long to int conversion in
UncompressedSplitLineReader and IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/MAPREDUCE-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du reassigned MAPREDUCE-6635:
-------------------------------------
Assignee: Junping Du
> Unsafe long to int conversion in UncompressedSplitLineReader and IndexOutOfBoundsException
> ------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6635
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6635
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Junping Du
>
> LineRecordReader creates the unsplittable reader like so:
> {noformat}
> in = new UncompressedSplitLineReader(
> fileIn, job, recordDelimiter, split.getLength());
> {noformat}
> Split length goes to
> {noformat}
> private long splitLength;
> {noformat}
> At some point when reading the first line, fillBuffer does this:
> {noformat}
> @Override
> protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
> throws IOException {
> int maxBytesToRead = buffer.length;
> if (totalBytesRead < splitLength) {
> maxBytesToRead = Math.min(maxBytesToRead,
> (int)(splitLength - totalBytesRead));
> {noformat}
> which will be a negative number for large splits, and the subsequent dfs read will fail with a boundary check.
> {noformat}
> java.lang.IndexOutOfBoundsException
> at java.nio.Buffer.checkBounds(Buffer.java:559)
> at java.nio.ByteBuffer.get(ByteBuffer.java:668)
> at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
> at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:172)
> at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744)
> at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800)
> at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
> at java.io.DataInputStream.read(DataInputStream.java:149)
> at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
> at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
> at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
> {noformat}
> This has been reported here: https://issues.streamsets.com/browse/SDC-2229, also happens in Hive if very large text files are forced to be read in a single split (e.g. via header-skipping feature, or via set mapred.min.split.size=9999999999999999)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)