You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Vishwas Siravara <vs...@gmail.com> on 2019/07/25 19:52:49 UTC

Deserialize only a subset of the avro message

Hey all,
I am using avro to deserialize a message with 8000 fields from kafka, we
use binary encoding scheme , so the schema is not sent with each poll. We
don't use any schema registry. My problem is I only want 20 fields from
this message. I tried to make my deserialization schema a subset of the
producer schema but it does not work. I get this error :
"org.apache.avro.AvroRuntimeException: Malformed data. Length is negative:
-27". How can I fix this? Should I read the entire message and then filter
what I require ? Thanks you so much for your help.

For instance if my producer schema is like this :

"fields" : [ {
  "name" : "CMLS_MRCH_CTRY_CD_NUM_DRVD",
  "type" : "int",
  "doc" : "decimal(3,0)",
  "default" : 0
},{
  "name" : "CMLS_ISSR_BIN_DRVD",
  "type" : "int",
  "doc" : "decimal(6,0)",
  "default" : 0
}, {
  "name" : "CMLS_DGTL_CMRC_PGM_IND",
  "type" : "int",
  "doc" : "decimal(1,0)",
  "default" : 0
}]

If I only want the first field from the message , my code breaks when I
change the deserialization schema to

"fields" : [ {
  "name" : "CMLS_MRCH_CTRY_CD_NUM_DRVD",
  "type" : "int",
  "doc" : "decimal(3,0)",
  "default" : 0
}]

Re: Deserialize only a subset of the avro message

Posted by David Carlton <ca...@sumologic.com>.

It's okay for the reader to use a schema that's a subset of the writer
schema, but I think when you do that you need to invoke a deserialization
code path that provides both the reader and writer schemas, otherwise the
deserialization code path will assume that the reader and writer schemas
were the same.  (E.g. if you're using the Java GenericDatumReader, you'll
see different constructors based on whether the schemas are the same or
different.)

On Thu, Jul 25, 2019 at 12:53 PM Vishwas Siravara <vs...@gmail.com>
wrote:

> Hey all,
> I am using avro to deserialize a message with 8000 fields from kafka, we
> use binary encoding scheme , so the schema is not sent with each poll. We
> don't use any schema registry. My problem is I only want 20 fields from
> this message. I tried to make my deserialization schema a subset of the
> producer schema but it does not work. I get this error :
> "org.apache.avro.AvroRuntimeException: Malformed data. Length is negative:
> -27". How can I fix this? Should I read the entire message and then filter
> what I require ? Thanks you so much for your help.
>
> For instance if my producer schema is like this :
>
> "fields" : [ {
>   "name" : "CMLS_MRCH_CTRY_CD_NUM_DRVD",
>   "type" : "int",
>   "doc" : "decimal(3,0)",
>   "default" : 0
> },{
>   "name" : "CMLS_ISSR_BIN_DRVD",
>   "type" : "int",
>   "doc" : "decimal(6,0)",
>   "default" : 0
> }, {
>   "name" : "CMLS_DGTL_CMRC_PGM_IND",
>   "type" : "int",
>   "doc" : "decimal(1,0)",
>   "default" : 0
> }]
>
> If I only want the first field from the message , my code breaks when I
> change the deserialization schema to
>
> "fields" : [ {
>   "name" : "CMLS_MRCH_CTRY_CD_NUM_DRVD",
>   "type" : "int",
>   "doc" : "decimal(3,0)",
>   "default" : 0
> }]
>
>

-- 
David Carlton
carlton@sumologic.com