You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Yong Zhang <ja...@hotmail.com> on 2016/01/21 16:53:38 UTC

Strange AVRO 1.7.4 exception

Hi, AVRO gurus:
I am facing a strange case in the AVRO right now, and hope someone here can at least give me some hints or a direction debugging it.
Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
We have a ETL implemented in a chain of MR jobs. In one MR job, it is going to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is in HDFS location B, and both contains the AVRO records binding to the same AVRO schema. The record contains an unique id field, and a timestamp field. The MR job is to merge the records based on the ID, and use the later timestamp record to replace previous timestamp record, and omit the final AVRO record out. Very straightforward.
Now we faced a problem that one reducer keeps failing with the following stacktrace on JobTracker:
java.lang.IndexOutOfBoundsException
	at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
	at java.io.DataInputStream.read(DataInputStream.java:160)
	at org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
	at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(AccessController.java:366)
	at javax.security.auth.Subject.doAs(Subject.java:572)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the my Mapper and Reducer methods:Mapper:public void map(AvroKey<SpecificRecord> key, NullWritable value, Context context) throws IOException, InterruptedException Reducer:protected void reduce(CustomPartitionKeyClass key, Iterable<AvroValue<SpecificRecord>> values, Context context) throws IOException, InterruptedException 
What bother me are the following facts:1) All the mappers finish without error2) Most of the reducers finish without error, but one reducer keeps failing with the above error.3) It looks like caused by the data? But keep in mind that all the avro records passed the mapper side, but failed in one reducer. 4) From the stacktrace, it looks like our reducer code was NOT invoked yet, but failed before that. So my guess is that all the AVRO records pass through the mapper side, but AVRO complains the intermediate result generated by the one mapper? In my understanding, that will be a Sequence file generated by Hadoop, and value part will be the AVRO bytes. Is the above error meaning that AVRO cannot deserialize the value part from the sequence file?5) Our ETL run fine for more than one year, but suddenly got this error starting from one day, and kept getting this problem after that. 6) If it helps, here is the schema for the avro record:{
    "namespace" : "company name",
    "type" : "record",
    "name" : "Lists",
    "fields" : [
        {"name" : "account_id", "type" : "long"},
        {"name" : "list_id", "type" : "string"},
        {"name" : "sequence_id", "type" : ["int", "null"]} ,
        {"name" : "name", "type" : ["string", "null"]},
        {"name" : "state", "type" : ["string", "null"]},
        {"name" : "description", "type" : ["string", "null"]},
        {"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
        {"name" : "filter_criteria", "type" : ["string", "null"]},
        {"name" : "created_at", "type" : ["long", "null"]},
        {"name" : "updated_at", "type" : ["long", "null"]},
        {"name" : "deleted_at", "type" : ["long", "null"]},
        {"name" : "favorite", "type" : ["int", "null"]},
        {"name" : "delta", "type" : ["boolean", "null"]},
        {
            "name" : "list_memberships", "type" : {
                "type" : "array", "items" : {
                    "name" : "ListMembership", "type" : "record",
                    "fields" : [
                        {"name" : "channel_id", "type" : "string"},
                        {"name" : "created_at", "type" : ["long", "null"]},
                        {"name" : "created_source", "type" : ["string", "null"]},
                        {"name" : "deleted_at", "type" : ["long", "null"]},
                        {"name" : "sequence_id", "type" : ["long", "null"]}
                    ]
                }
            }
        }
    ]
}
What I am looking for is that someone can help me understand what could cause this problem in the reducer side? From all the facts, it points to the data. As whole thing works for more than one year, but suddenly cannot merge the new coming data. But what kind of data cause the above error, or how should I debug this issue? The data generated as AVRO records just look fine and read in the mapper side without any issue, but some data cause this failure in the reducer side.
Thanks for your help.
Yong 		 	   		  

RE: Strange AVRO 1.7.4 exception

Posted by Yong Zhang <ja...@hotmail.com>.
I modify the DirectBinaryDecorder class to add the following line from 116, at runtime to dump more information about it:if (((n >>> 1 ^ -(n & 1)) < 0){ System.out.println("Got ((n >>> 1 ^ -(n & 1)) = " + ((n >>> 1 ^ -(n & 1))); System.out.println("And b = " + b); System.out.println("And shift = " + shift); System.out.println("And n = " + n); }
Here is the output:
Got ((n >>> 1) ^ -(n & 1)) = -51
And b = 101
And shift = 0
And n = 101
Not sure what do these values mean, but does it help to prove this is a bug?
Thanks
From: java8964@hotmail.com
To: user@avro.apache.org
Subject: RE: Strange AVRO 1.7.4 exception
Date: Fri, 22 Jan 2016 11:48:07 -0500




Based on AVRO-1198, when a negative value return in this case, it means the the data is Malformed, and this JIRA even patched a message to show it in the Exception.
But in my case:
The avro data is NOT malformed. Why? I can query the whole ready to merge data set in the Hive without any issue, even for the complained records.In my case, the real runtime object is "DirectBinaryDecoder", as you can see the stacktrace, doReadBytes method indeed point to this class.So the ReadInt() method returns "-51" in this case, must be from DirectBinaryDecoder class, which I list here:           https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryDecoder.java           starting from line 97, but I really don't know why in my case, it return "-51".I tested with AVRO 1.7.7 (The latest release of 1.7.x), the same problem still happens.
Thanks
Yong
From: java8964@hotmail.com
To: user@avro.apache.org
Subject: RE: Strange AVRO 1.7.4 exception
Date: Fri, 22 Jan 2016 11:03:43 -0500




With lots of time debugging this, I found out some interested facts, and want to know if any one can provide more information related to this.
In our AVRO schema, if the "list_id", which is a UUID, contain the following characters "3a3ffb10be8b11e3977ad4ae5284344f", that record will cause this exception.
In this case, if you check the following Avro class: BinaryDecoder.java, from here:
https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java
The above link is pointing to release 1.7.4, and starting from line 259:
int length = readInt();Utf8 result = (old != null ? old : new Utf8());    result.setByteLength(length);    if (0 != length) {      doReadBytes(result.getBytes(), 0, length);    }In this case when the exception happens, the length in fact is "-51", which is calculated from method "readInt()", causing IndexOutOfBoundsException.
I am not very understanding how the readInt() method works, but can anyone share any idea how this method will return a negative value under what kind of cases?
Thanks
Yong

From: java8964@hotmail.com
To: user@avro.apache.org
Subject: Strange AVRO 1.7.4 exception
Date: Thu, 21 Jan 2016 10:53:38 -0500




Hi, AVRO gurus:
I am facing a strange case in the AVRO right now, and hope someone here can at least give me some hints or a direction debugging it.
Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
We have a ETL implemented in a chain of MR jobs. In one MR job, it is going to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is in HDFS location B, and both contains the AVRO records binding to the same AVRO schema. The record contains an unique id field, and a timestamp field. The MR job is to merge the records based on the ID, and use the later timestamp record to replace previous timestamp record, and omit the final AVRO record out. Very straightforward.
Now we faced a problem that one reducer keeps failing with the following stacktrace on JobTracker:
java.lang.IndexOutOfBoundsException
	at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
	at java.io.DataInputStream.read(DataInputStream.java:160)
	at org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
	at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(AccessController.java:366)
	at javax.security.auth.Subject.doAs(Subject.java:572)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the my Mapper and Reducer methods:Mapper:public void map(AvroKey<SpecificRecord> key, NullWritable value, Context context) throws IOException, InterruptedException Reducer:protected void reduce(CustomPartitionKeyClass key, Iterable<AvroValue<SpecificRecord>> values, Context context) throws IOException, InterruptedException 
What bother me are the following facts:1) All the mappers finish without error2) Most of the reducers finish without error, but one reducer keeps failing with the above error.3) It looks like caused by the data? But keep in mind that all the avro records passed the mapper side, but failed in one reducer. 4) From the stacktrace, it looks like our reducer code was NOT invoked yet, but failed before that. So my guess is that all the AVRO records pass through the mapper side, but AVRO complains the intermediate result generated by the one mapper? In my understanding, that will be a Sequence file generated by Hadoop, and value part will be the AVRO bytes. Is the above error meaning that AVRO cannot deserialize the value part from the sequence file?5) Our ETL run fine for more than one year, but suddenly got this error starting from one day, and kept getting this problem after that. 6) If it helps, here is the schema for the avro record:{
    "namespace" : "company name",
    "type" : "record",
    "name" : "Lists",
    "fields" : [
        {"name" : "account_id", "type" : "long"},
        {"name" : "list_id", "type" : "string"},
        {"name" : "sequence_id", "type" : ["int", "null"]} ,
        {"name" : "name", "type" : ["string", "null"]},
        {"name" : "state", "type" : ["string", "null"]},
        {"name" : "description", "type" : ["string", "null"]},
        {"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
        {"name" : "filter_criteria", "type" : ["string", "null"]},
        {"name" : "created_at", "type" : ["long", "null"]},
        {"name" : "updated_at", "type" : ["long", "null"]},
        {"name" : "deleted_at", "type" : ["long", "null"]},
        {"name" : "favorite", "type" : ["int", "null"]},
        {"name" : "delta", "type" : ["boolean", "null"]},
        {
            "name" : "list_memberships", "type" : {
                "type" : "array", "items" : {
                    "name" : "ListMembership", "type" : "record",
                    "fields" : [
                        {"name" : "channel_id", "type" : "string"},
                        {"name" : "created_at", "type" : ["long", "null"]},
                        {"name" : "created_source", "type" : ["string", "null"]},
                        {"name" : "deleted_at", "type" : ["long", "null"]},
                        {"name" : "sequence_id", "type" : ["long", "null"]}
                    ]
                }
            }
        }
    ]
}
What I am looking for is that someone can help me understand what could cause this problem in the reducer side? From all the facts, it points to the data. As whole thing works for more than one year, but suddenly cannot merge the new coming data. But what kind of data cause the above error, or how should I debug this issue? The data generated as AVRO records just look fine and read in the mapper side without any issue, but some data cause this failure in the reducer side.
Thanks for your help.
Yong 		 	   		   		 	   		   		 	   		   		 	   		  

RE: Strange AVRO 1.7.4 exception

Posted by Yong Zhang <ja...@hotmail.com>.
Based on AVRO-1198, when a negative value return in this case, it means the the data is Malformed, and this JIRA even patched a message to show it in the Exception.
But in my case:
The avro data is NOT malformed. Why? I can query the whole ready to merge data set in the Hive without any issue, even for the complained records.In my case, the real runtime object is "DirectBinaryDecoder", as you can see the stacktrace, doReadBytes method indeed point to this class.So the ReadInt() method returns "-51" in this case, must be from DirectBinaryDecoder class, which I list here:           https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryDecoder.java           starting from line 97, but I really don't know why in my case, it return "-51".I tested with AVRO 1.7.7 (The latest release of 1.7.x), the same problem still happens.
Thanks
Yong
From: java8964@hotmail.com
To: user@avro.apache.org
Subject: RE: Strange AVRO 1.7.4 exception
Date: Fri, 22 Jan 2016 11:03:43 -0500




With lots of time debugging this, I found out some interested facts, and want to know if any one can provide more information related to this.
In our AVRO schema, if the "list_id", which is a UUID, contain the following characters "3a3ffb10be8b11e3977ad4ae5284344f", that record will cause this exception.
In this case, if you check the following Avro class: BinaryDecoder.java, from here:
https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java
The above link is pointing to release 1.7.4, and starting from line 259:
int length = readInt();Utf8 result = (old != null ? old : new Utf8());    result.setByteLength(length);    if (0 != length) {      doReadBytes(result.getBytes(), 0, length);    }In this case when the exception happens, the length in fact is "-51", which is calculated from method "readInt()", causing IndexOutOfBoundsException.
I am not very understanding how the readInt() method works, but can anyone share any idea how this method will return a negative value under what kind of cases?
Thanks
Yong

From: java8964@hotmail.com
To: user@avro.apache.org
Subject: Strange AVRO 1.7.4 exception
Date: Thu, 21 Jan 2016 10:53:38 -0500




Hi, AVRO gurus:
I am facing a strange case in the AVRO right now, and hope someone here can at least give me some hints or a direction debugging it.
Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
We have a ETL implemented in a chain of MR jobs. In one MR job, it is going to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is in HDFS location B, and both contains the AVRO records binding to the same AVRO schema. The record contains an unique id field, and a timestamp field. The MR job is to merge the records based on the ID, and use the later timestamp record to replace previous timestamp record, and omit the final AVRO record out. Very straightforward.
Now we faced a problem that one reducer keeps failing with the following stacktrace on JobTracker:
java.lang.IndexOutOfBoundsException
	at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
	at java.io.DataInputStream.read(DataInputStream.java:160)
	at org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
	at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(AccessController.java:366)
	at javax.security.auth.Subject.doAs(Subject.java:572)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the my Mapper and Reducer methods:Mapper:public void map(AvroKey<SpecificRecord> key, NullWritable value, Context context) throws IOException, InterruptedException Reducer:protected void reduce(CustomPartitionKeyClass key, Iterable<AvroValue<SpecificRecord>> values, Context context) throws IOException, InterruptedException 
What bother me are the following facts:1) All the mappers finish without error2) Most of the reducers finish without error, but one reducer keeps failing with the above error.3) It looks like caused by the data? But keep in mind that all the avro records passed the mapper side, but failed in one reducer. 4) From the stacktrace, it looks like our reducer code was NOT invoked yet, but failed before that. So my guess is that all the AVRO records pass through the mapper side, but AVRO complains the intermediate result generated by the one mapper? In my understanding, that will be a Sequence file generated by Hadoop, and value part will be the AVRO bytes. Is the above error meaning that AVRO cannot deserialize the value part from the sequence file?5) Our ETL run fine for more than one year, but suddenly got this error starting from one day, and kept getting this problem after that. 6) If it helps, here is the schema for the avro record:{
    "namespace" : "company name",
    "type" : "record",
    "name" : "Lists",
    "fields" : [
        {"name" : "account_id", "type" : "long"},
        {"name" : "list_id", "type" : "string"},
        {"name" : "sequence_id", "type" : ["int", "null"]} ,
        {"name" : "name", "type" : ["string", "null"]},
        {"name" : "state", "type" : ["string", "null"]},
        {"name" : "description", "type" : ["string", "null"]},
        {"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
        {"name" : "filter_criteria", "type" : ["string", "null"]},
        {"name" : "created_at", "type" : ["long", "null"]},
        {"name" : "updated_at", "type" : ["long", "null"]},
        {"name" : "deleted_at", "type" : ["long", "null"]},
        {"name" : "favorite", "type" : ["int", "null"]},
        {"name" : "delta", "type" : ["boolean", "null"]},
        {
            "name" : "list_memberships", "type" : {
                "type" : "array", "items" : {
                    "name" : "ListMembership", "type" : "record",
                    "fields" : [
                        {"name" : "channel_id", "type" : "string"},
                        {"name" : "created_at", "type" : ["long", "null"]},
                        {"name" : "created_source", "type" : ["string", "null"]},
                        {"name" : "deleted_at", "type" : ["long", "null"]},
                        {"name" : "sequence_id", "type" : ["long", "null"]}
                    ]
                }
            }
        }
    ]
}
What I am looking for is that someone can help me understand what could cause this problem in the reducer side? From all the facts, it points to the data. As whole thing works for more than one year, but suddenly cannot merge the new coming data. But what kind of data cause the above error, or how should I debug this issue? The data generated as AVRO records just look fine and read in the mapper side without any issue, but some data cause this failure in the reducer side.
Thanks for your help.
Yong 		 	   		   		 	   		   		 	   		  

RE: Strange AVRO 1.7.4 exception

Posted by Yong Zhang <ja...@hotmail.com>.
With lots of time debugging this, I found out some interested facts, and want to know if any one can provide more information related to this.
In our AVRO schema, if the "list_id", which is a UUID, contain the following characters "3a3ffb10be8b11e3977ad4ae5284344f", that record will cause this exception.
In this case, if you check the following Avro class: BinaryDecoder.java, from here:
https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java
The above link is pointing to release 1.7.4, and starting from line 259:
int length = readInt();Utf8 result = (old != null ? old : new Utf8());    result.setByteLength(length);    if (0 != length) {      doReadBytes(result.getBytes(), 0, length);    }In this case when the exception happens, the length in fact is "-51", which is calculated from method "readInt()", causing IndexOutOfBoundsException.
I am not very understanding how the readInt() method works, but can anyone share any idea how this method will return a negative value under what kind of cases?
Thanks
Yong

From: java8964@hotmail.com
To: user@avro.apache.org
Subject: Strange AVRO 1.7.4 exception
Date: Thu, 21 Jan 2016 10:53:38 -0500




Hi, AVRO gurus:
I am facing a strange case in the AVRO right now, and hope someone here can at least give me some hints or a direction debugging it.
Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
We have a ETL implemented in a chain of MR jobs. In one MR job, it is going to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is in HDFS location B, and both contains the AVRO records binding to the same AVRO schema. The record contains an unique id field, and a timestamp field. The MR job is to merge the records based on the ID, and use the later timestamp record to replace previous timestamp record, and omit the final AVRO record out. Very straightforward.
Now we faced a problem that one reducer keeps failing with the following stacktrace on JobTracker:
java.lang.IndexOutOfBoundsException
	at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
	at java.io.DataInputStream.read(DataInputStream.java:160)
	at org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
	at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
	at org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
	at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
	at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
	at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(AccessController.java:366)
	at javax.security.auth.Subject.doAs(Subject.java:572)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the my Mapper and Reducer methods:Mapper:public void map(AvroKey<SpecificRecord> key, NullWritable value, Context context) throws IOException, InterruptedException Reducer:protected void reduce(CustomPartitionKeyClass key, Iterable<AvroValue<SpecificRecord>> values, Context context) throws IOException, InterruptedException 
What bother me are the following facts:1) All the mappers finish without error2) Most of the reducers finish without error, but one reducer keeps failing with the above error.3) It looks like caused by the data? But keep in mind that all the avro records passed the mapper side, but failed in one reducer. 4) From the stacktrace, it looks like our reducer code was NOT invoked yet, but failed before that. So my guess is that all the AVRO records pass through the mapper side, but AVRO complains the intermediate result generated by the one mapper? In my understanding, that will be a Sequence file generated by Hadoop, and value part will be the AVRO bytes. Is the above error meaning that AVRO cannot deserialize the value part from the sequence file?5) Our ETL run fine for more than one year, but suddenly got this error starting from one day, and kept getting this problem after that. 6) If it helps, here is the schema for the avro record:{
    "namespace" : "company name",
    "type" : "record",
    "name" : "Lists",
    "fields" : [
        {"name" : "account_id", "type" : "long"},
        {"name" : "list_id", "type" : "string"},
        {"name" : "sequence_id", "type" : ["int", "null"]} ,
        {"name" : "name", "type" : ["string", "null"]},
        {"name" : "state", "type" : ["string", "null"]},
        {"name" : "description", "type" : ["string", "null"]},
        {"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
        {"name" : "filter_criteria", "type" : ["string", "null"]},
        {"name" : "created_at", "type" : ["long", "null"]},
        {"name" : "updated_at", "type" : ["long", "null"]},
        {"name" : "deleted_at", "type" : ["long", "null"]},
        {"name" : "favorite", "type" : ["int", "null"]},
        {"name" : "delta", "type" : ["boolean", "null"]},
        {
            "name" : "list_memberships", "type" : {
                "type" : "array", "items" : {
                    "name" : "ListMembership", "type" : "record",
                    "fields" : [
                        {"name" : "channel_id", "type" : "string"},
                        {"name" : "created_at", "type" : ["long", "null"]},
                        {"name" : "created_source", "type" : ["string", "null"]},
                        {"name" : "deleted_at", "type" : ["long", "null"]},
                        {"name" : "sequence_id", "type" : ["long", "null"]}
                    ]
                }
            }
        }
    ]
}
What I am looking for is that someone can help me understand what could cause this problem in the reducer side? From all the facts, it points to the data. As whole thing works for more than one year, but suddenly cannot merge the new coming data. But what kind of data cause the above error, or how should I debug this issue? The data generated as AVRO records just look fine and read in the mapper side without any issue, but some data cause this failure in the reducer side.
Thanks for your help.
Yong