You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Puneet Gupta (JIRA)" <ji...@apache.org> on 2014/02/15 09:53:19 UTC

[jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

    [ https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902353#comment-13902353 ] 

Puneet Gupta commented on HIVE-5922:
------------------------------------

I got a similar Exception  ( on seeking to row 9,103,258 )

java.io.IOException: Seek outside of data in compressed stream Stream for column 65 kind DATA position: 1572882 length: 2116178 range: 1 offset: 1048588 limit: 1048588 range 0 = 0 to 0;  range 1 = 524294 to 1048588;  range 2 = 1835029 to 262147 uncompressed: 1048588 to 1048588 to 1572882
	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:277)
	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:153)
	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:197)
	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:161)
	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:54)
	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.skip(RunLengthIntegerReaderV2.java:318)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.skipRows(RecordReaderImpl.java:427)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.skipRows(RecordReaderImpl.java:1181)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:2183)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.seekToRow(RecordReaderImpl.java:2284)

Some observations
----
1. I have used Snappy for compression 

2. there are 75 columns in the file (mostly numbers - int,long,byte,short  and a few strings). Exception always happens for column 65 which is an int. If I remove this column from include column list, seek works fine . 

2. This issue happens only when I am seeking to row using RecordReader.seekToRow(long). In this flow the RecordReader is created using  Reader.rows(long, long, boolean[], SearchArgument, String[]). The SearchArgument is using "IN" construct with 200 long values which are actually the row numbers I want to retrieve   (SearchArgument.FACTORY.newBuilder().startOr().in(colName, 200 Long Values).end().build()). Exception happens for seek to row 9103258 (file has about 13 million rows). I tried SearchArgument with just one IN value of 9103258.... BINGO .. got the same Exception. This problem can be reproduced for any rowSeek between 9103258 and 9103279. Rows after this seem to work fine .

3. I face no Exceptions if the RecordReader is created using Reader.rows(null), and the entire file is iterated using RecordReader.hasNext() and RecordReader.next()

4. I face no Exceptions if the RecordReader is created using  Reader.rows(long, long, boolean[], SearchArgument, String[]) and SearchArgument is passed null. Then the required data (about 200 rows) is retrieved using RecordReader.seekToRow(long) and RecordReader.next()

5.Obvious WorkAround is not to use predicate push down . In may case since I know the row numbers to be seeked to, the performance let down is not very drastic. 
	Read/SeekTO 167 rows in (ms)3609   : Existing usage with predicate push down in ORC  
	Read/SeekTo 167 rows in (ms)4626   : WorkAround without predicate/Search-Argument pushdown
	   >>>> Difference of  1017 ms  = approx 7 ms per row let down in performance (arounf 80% values are fetched from different strides)


> In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5922
>                 URL: https://issues.apache.org/jira/browse/HIVE-5922
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>            Reporter: Yin Huai
>
> Two stack traces ...
> {code}
> java.io.IOException: IO error in map input file hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/000004_0
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.io.IOException: Seek outside of data in compressed stream Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> 	... 9 more
> Caused by: java.io.IOException: Seek outside of data in compressed stream Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
> 	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
> 	... 13 more
> {\code}
> {code}
> java.io.IOException: IO error in map input file hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/000095_0
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.lang.IllegalStateException: Can't read header at compressed stream Stream for column 9 kind DATA position: 20447466 length: 20958101 range: 6 offset: 1835029 limit: 1835029 range 0 = 0 to 524294;  range 1 = 1835029 to 2097176;  range 2 = 5242940 to 1835029;  range 3 = 8650851 to 1835029;  range 4 = 11796615 to 2097176;  range 5 = 15204526 to 2097176;  range 6 = 18612437 to 1835029 uncompressed: 262144 to 262144
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> 	... 9 more
> Caused by: java.lang.IllegalStateException: Can't read header at compressed stream Stream for column 9 kind DATA position: 20447466 length: 20958101 range: 6 offset: 1835029 limit: 1835029 range 0 = 0 to 524294;  range 1 = 1835029 to 2097176;  range 2 = 5242940 to 1835029;  range 3 = 8650851 to 1835029;  range 4 = 11796615 to 2097176;  range 5 = 15204526 to 2097176;  range 6 = 18612437 to 1835029 uncompressed: 262144 to 262144
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:195)
> 	at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
> 	at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
> 	at org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
> 	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
> 	... 13 more
> {\code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)