You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Keith Wright <kw...@nanigans.com> on 2013/05/28 15:02:13 UTC

IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

Hi all,

   This is my first post to the hive mailing list and I was hoping to get some help with the exception I am getting below.  I am using CDH4.2 (hive 0.10.0) to query snappy compressed, Sequence files that are built using Flume (relevant portion of flume conf below as well).  Note that I'm using a SequenceFile as it was needed for Impala integration.  Has anyone see this error before?  Couple of additional points to help diagnose:

 1.  Queries seem to be able to process some mappers without issues.   In fact, I can do a simple select * from <table> limit 10 without issue. However if I make the limit high enough, it will eventually fail presumably as it needs to read in a file that has this issue.
 2.  The same query runs in Impala without errors but appears to "skip" some data.  I can confirm that the missing data is present via a custom map/reduce job
 3.  I am able to write a map/reduce job that reads through all of the same data without issue and have been unable to identify data corruption
 4.  This is a partitioned table and queries fail that touch ANY of the partitions (and there are hundreds) so this does not appear to be a sporadic, data integrity problem (table definition below)
 5.  We are using '\001' as our field separator.  We are capturing other data also with SequenceFile, snappy but using '|' as our delimiter and we do not have any issues querying there.  Although we are using a different flume source.

My next step for debugging was to disable snappy compression and see if I could query the data.  If not, switch from SequenceFile to simple text.

I appreciate the help!!!

CREATE EXTERNAL TABLE ORGANIC_EVENTS (
event_id BIGINT,
app_id INT,
user_id BIGINT,
type STRING,
name STRING,
value STRING,
extra STRING,
ip_address STRING,
user_agent STRING,
referrer STRING,
event_time BIGINT,
install_flag TINYINT,
first_for_user TINYINT,
cookie STRING,
year int,
month int,
day int,
hour int)  PARTITIONED BY (year int, month int, day int,hour int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE
LOCATION '/events/organic';

agent.sinks.exhaustHDFSSink3.type = HDFS
agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
agent.sinks.exhaustHDFSSink3.hdfs.path = hdfs://lxscdh001.nanigans.com:8020%{path}
agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text


2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running child                              java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException
                              at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
                              at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
                              at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
                              at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)
                              at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216)
                              at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201)
                              at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
                              at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
                              at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
                              at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
                              at java.security.AccessController.doPrivileged(Native Method)
                              at javax.security.auth.Subject.doAs(Subject.java:396)
                              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
                              at org.apache.hadoop.mapred.Child.main(Child.java:262)
                              Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException
                              at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
                              at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
                              at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
                              at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
                              at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
                              at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
                              at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:328)
                              ... 11 more
                              Caused by: java.lang.IndexOutOfBoundsException
                              at java.io.DataInputStream.readFully(DataInputStream.java:175)
                              at org.apache.hadoop.io.Text.readFields(Text.java:284)
                              at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
                              at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
                              at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2180)
                              at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2164)
                              at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
                              at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
                              at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
                              ... 15 more

Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

Posted by Shreepadma Venugopalan <sh...@cloudera.com>.
Hi Keith,

Were you able to resolve this? Or, is this still an issue?

Thanks.
Shreepadma


On Tue, May 28, 2013 at 6:02 AM, Keith Wright <kw...@nanigans.com> wrote:

> Hi all,
>
>    This is my first post to the hive mailing list and I was hoping to get
> some help with the exception I am getting below.  I am using CDH4.2 (hive
> 0.10.0) to query snappy compressed, Sequence files that are built using
> Flume (relevant portion of flume conf below as well).  Note that I'm using
> a SequenceFile as it was needed for Impala integration.  Has anyone see
> this error before?  Couple of additional points to help diagnose:
>
>    1. Queries seem to be able to process some mappers without issues.
>    In fact, I can do a simple select * from <table> limit 10 without issue.
>    However if I make the limit high enough, it will eventually fail presumably
>    as it needs to read in a file that has this issue.
>    2. The same query runs in Impala without errors but appears to "skip"
>    some data.  I can confirm that the missing data is present via a custom
>    map/reduce job
>    3. I am able to write a map/reduce job that reads through all of the
>    same data without issue and have been unable to identify data corruption
>    4. This is a partitioned table and queries fail that touch ANY of the
>    partitions (and there are hundreds) so this does not appear to be a
>    sporadic, data integrity problem (table definition below)
>    5. We are using '\001' as our field separator.  We are capturing other
>    data also with SequenceFile, snappy but using '|' as our delimiter and we
>    do not have any issues querying there.  Although we are using a different
>    flume source.
>
> My next step for debugging was to disable snappy compression and see if I
> could query the data.  If not, switch from SequenceFile to simple text.
>
> I appreciate the help!!!
>
> CREATE EXTERNAL TABLE ORGANIC_EVENTS (
> event_id BIGINT,
> app_id INT,
> user_id BIGINT,
> type STRING,
> name STRING,
> value STRING,
> extra STRING,
> ip_address STRING,
> user_agent STRING,
> referrer STRING,
> event_time BIGINT,
> install_flag TINYINT,
> first_for_user TINYINT,
> cookie STRING,
> year int,
> month int,
> day int,
> hour int)  PARTITIONED BY (year int, month int, day int,hour int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
> COLLECTION ITEMS TERMINATED BY '\002'
> MAP KEYS TERMINATED BY '\003'
> STORED AS SEQUENCEFILE
> LOCATION '/events/organic';
>
> agent.sinks.exhaustHDFSSink3.type = HDFS
> agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
> agent.sinks.exhaustHDFSSink3.hdfs.path = hdfs://lxscdh001.nanigans.com:8020
> %{path}
> agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
> agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
> agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
> agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
> agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
> agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
> agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
> agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
> agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
> agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
> agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text
>
> 2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running child                              java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException
>                               at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>                               at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>                               at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
>                               at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)
>                               at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216)
>                               at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201)
>                               at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>                               at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
>                               at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
>                               at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>                               at java.security.AccessController.doPrivileged(Native Method)
>                               at javax.security.auth.Subject.doAs(Subject.java:396)
>                               at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>                               at org.apache.hadoop.mapred.Child.main(Child.java:262)
>                               Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException
>                               at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>                               at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>                               at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
>                               at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
>                               at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
>                               at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
>                               at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:328)
>                               ... 11 more
>                               Caused by: java.lang.IndexOutOfBoundsException
>                               at java.io.DataInputStream.readFully(DataInputStream.java:175)
>                               at org.apache.hadoop.io.Text.readFields(Text.java:284)
>                               at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
>                               at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
>                               at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2180)
>                               at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2164)
>                               at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>                               at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>                               at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>                               ... 15 more
>
>