You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Pedro Cardoso <pe...@feedzai.com> on 2020/01/29 15:21:55 UTC

How to serialize & deserialize contiguous block of GenericRecords

Hello,

I am trying to write a sequence of Avro GenericRecords into a Java
ByteBuffer and later on deserialize them. I have tried using
FileWriter/Readers and copying the content of the underlying buffer to my
target object. The alternative is to try to split a ByteBuffer by the
serialized GenericRecords individually and use a BinaryDecoder to read each
property of a record individually.

Please see attached such an example of the former code.
The presented code fails with

org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:223)
at com.feedzai.research.experiments.bookkeeper.Avro.main(Avro.java:97)
Caused by: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:318)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:212)
... 1 more

Hence my questions are:
 - Is it at all possible to serialize/deserialize lists of Avro records to
a ByteBuffer and back?
 - If so, can anyone point me in the right direction?
 - If not, can anyone point me to code examples of alternative solutions?

Thank you and have a good day.

Pedro Cardoso

Research Data Engineer

pedro.cardoso@feedzai.com




[image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
<https://feedzai.com/>[image: Feedzai in Forbes Fintech 50!]
<https://www.forbes.com/fintech/list/>

-- 
The content of this email is confidential and 
intended for the recipient 
specified in message only. It is strictly 
prohibited to share any part of 
this message with any third party, 
without a written consent of the 
sender. If you received this message by
 mistake, please reply to this 
message and follow with its deletion, so 
that we can ensure such a mistake 
does not occur in the future.

Re: How to serialize & deserialize contiguous block of GenericRecords

Posted by Ryan Skraba <ry...@skraba.com>.
Ah!  OK, I think I understand better.

Your serialize method looks almost OK -- as I mentioned, you can use an
OutputStream wrapper to write directly to a ByteBuffer.  This wrapper
doesn't exist in the Java utilities AFAIK, but there are examples on the
web (
https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/io/ByteBufferOutputStream.java).
The one I mentioned in the previous message wraps a list of ByteBuffers.

In any case, don't forget to *encoder.flush()* before closing the
outputStream in your serialize!

Your deserialize is a bit problematic, because the *entire* byte buffer
capacity will be passed if you use buffer.array(), not just the bytes that
were used.

Fortunately, you can use the ByteBufferInputStream already present in Avro
to handle this.  The code would look something like:

ByteBufferInputStream bbais = new
ByteBufferInputStream(Collections.singletonList(buffer));
final BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bbais, null);
final GenericDatumReader<GenericRecord> datumReader = new
GenericDatumReader<>(schema);

List<Event> out = new ArrayList<>();
while (!decoder.isEnd()) {
  GenericRecord record = datumReader.read(null, decoder);
  // Your transformation of record to event, and add to the list here...
}
return out;

It's critical that buffer has the position and limit set correctly to the
start and end of the binary data before entering this method, of course!
The position and limit will not be correct coming out of the serialize
method, although probably a buffer.flip() will do what you want.

I hope this is useful, all my best, Ryan


On Wed, Jan 29, 2020 at 7:35 PM Pedro Cardoso <pe...@feedzai.com>
wrote:

> Hi Ryan,
>
> Thank you so much for your reply! You were right about the encoder in the
> serializer method, that was my mistake. I submitted a png rather than just
> text because I thought the highlighting would help.
> I may not have been very clear about my question, I understand that via
> the DatumWriter/DatumReader I can serialize and deserialize a given Avro
> GenericRecord respectively.
>
> My question is, consider several GenericRecords all concatenated into a
> single byte array as follows:
>
> *[serializedGenericRecord1, serializedGenericRecord2,
> serializedGenericRecord3, etc...]*
>
> How can I deserialize them using the DatumReader API? If it's possible
> out-of-the-box can you point me in the right direction?
> Does this make sense?
>
> See the code below (in text this time :) ) if it helps:
>
> public void serialize(final List<Event> events, final UUID schemaId, final ByteBuffer buffer) throws IOException {
>     final Schema schema = getAvroSchema(schemaId);
>     final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
>     final Encoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null);
>     final GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
>
>     for (final Event event : events) {
>         final GenericData.Record record = new GenericData.Record(schema);
>         //populate record object
>         datumWriter.write(record, encoder);
>     }
>
>     outputStream.close();
>     buffer.put(outputStream.toByteArray());
> }
>
> public List<Event> deserialize(final ByteBuffer buffer, final UUID schemaId) throws IOException {
>     final List<Event> events = new ArrayList<>();
>     final Schema schema = getAvroSchema(schemaId);
>     final BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(buffer.array(), null);
>     final GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
>     GenericRecord record = new GenericData.Record(schema);
>
>     // How do I loop?
>     record  = datumReader.read(record, decoder);
>     // populate Event object and add to list
>
>     return events;
> }
>
>
> Thank you once again for your help!
>
> Cheers
> Pedro Cardoso
>
> Research Data Engineer
>
> pedro.cardoso@feedzai.com
>
>
>
>
> [image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
> Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
> with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
> <https://feedzai.com/>[image: Feedzai in Forbes Fintech 50!]
> <https://www.forbes.com/fintech/list/>
>
>
> On Wed, Jan 29, 2020 at 5:34 PM Ryan Skraba <ry...@skraba.com> wrote:
>
>> Hello!
>>
>> It's a bit difficult to discover what's going wrong -- I'm not sure that
>> the code in the image corresponds to the exception you are encountering!
>> Notably, there's no reference to DataFileStream...  Typically, it would be
>> easier with code as TXT than as PNG!
>>
>> It is definitely possible to serialize Avro GenericRecords into bytes!
>> The example code looks like it's using the DataFileWriter (and ignoring the
>> Encoder).  Keep in mind that this creates an Avro file (also known as a
>> Avro Object Container file or .avro file).  This is more than just "pure"
>> serialized bytes -- it contains some header information and sync markers,
>> which makes it easier to split and process a single file on multiple nodes
>> in big data.
>>
>> If you were to use a DatumWriter and an encoder, you could obtain just
>> the "pure" binary data without any framing bytes.  If that is your goal, I
>> suggest looking into the DatumWriter / DatumReader classes (as opposed to
>> the DataFileXxx classes).
>>
>> From the given exception "Invalid sync" it looks like you might be
>> writing pure Avro bytes and attempting to read the file format.
>>
>> Since the DatumWriter API uses OutputStream (instead of ByteBuffer),
>> there's a utility class called ByteBufferOutputStream that you might find
>> interesting.  It permits writing to a series of 8K java.nio.ByteBuffer
>> instances, which might be OK for your use case.  There are other
>> implementations of ByteBuffer-backed OutputStreams available that might be
>> better suited.
>>
>> I hope this is useful, Ryan
>>
>>
>> On Wed, Jan 29, 2020 at 4:22 PM Pedro Cardoso <pe...@feedzai.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I am trying to write a sequence of Avro GenericRecords into a Java
>>> ByteBuffer and later on deserialize them. I have tried using
>>> FileWriter/Readers and copying the content of the underlying buffer to my
>>> target object. The alternative is to try to split a ByteBuffer by the
>>> serialized GenericRecords individually and use a BinaryDecoder to read each
>>> property of a record individually.
>>>
>>> Please see attached such an example of the former code.
>>> The presented code fails with
>>>
>>> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
>>> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:223)
>>> at com.feedzai.research.experiments.bookkeeper.Avro.main(Avro.java:97)
>>> Caused by: java.io.IOException: Invalid sync!
>>> at
>>> org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:318)
>>> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:212)
>>> ... 1 more
>>>
>>> Hence my questions are:
>>>  - Is it at all possible to serialize/deserialize lists of Avro records
>>> to a ByteBuffer and back?
>>>  - If so, can anyone point me in the right direction?
>>>  - If not, can anyone point me to code examples of alternative solutions?
>>>
>>> Thank you and have a good day.
>>>
>>> Pedro Cardoso
>>>
>>> Research Data Engineer
>>>
>>> pedro.cardoso@feedzai.com
>>>
>>>
>>>
>>>
>>> [image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
>>> Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image:
>>> Connect with Feedzai on LinkedIn!]
>>> <https://www.linkedin.com/company/feedzai/> <https://feedzai.com/>[image:
>>> Feedzai in Forbes Fintech 50!] <https://www.forbes.com/fintech/list/>
>>>
>>> *The content of this email is confidential and intended for the
>>> recipient specified in message only. It is strictly prohibited to share any
>>> part of this message with any third party, without a written consent of the
>>> sender. If you received this message by mistake, please reply to this
>>> message and follow with its deletion, so that we can ensure such a mistake
>>> does not occur in the future.*
>>
>>
> *The content of this email is confidential and intended for the recipient
> specified in message only. It is strictly prohibited to share any part of
> this message with any third party, without a written consent of the sender.
> If you received this message by mistake, please reply to this message and
> follow with its deletion, so that we can ensure such a mistake does not
> occur in the future.*

Re: How to serialize & deserialize contiguous block of GenericRecords

Posted by Pedro Cardoso <pe...@feedzai.com>.
Hi Ryan,

Thank you so much for your reply! You were right about the encoder in the
serializer method, that was my mistake. I submitted a png rather than just
text because I thought the highlighting would help.
I may not have been very clear about my question, I understand that via the
DatumWriter/DatumReader I can serialize and deserialize a given Avro
GenericRecord respectively.

My question is, consider several GenericRecords all concatenated into a
single byte array as follows:

*[serializedGenericRecord1, serializedGenericRecord2,
serializedGenericRecord3, etc...]*

How can I deserialize them using the DatumReader API? If it's possible
out-of-the-box can you point me in the right direction?
Does this make sense?

See the code below (in text this time :) ) if it helps:

public void serialize(final List<Event> events, final UUID schemaId,
final ByteBuffer buffer) throws IOException {
    final Schema schema = getAvroSchema(schemaId);
    final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    final Encoder encoder =
EncoderFactory.get().binaryEncoder(outputStream, null);
    final GenericDatumWriter<GenericRecord> datumWriter = new
GenericDatumWriter<>(schema);

    for (final Event event : events) {
        final GenericData.Record record = new GenericData.Record(schema);
        //populate record object
        datumWriter.write(record, encoder);
    }

    outputStream.close();
    buffer.put(outputStream.toByteArray());
}

public List<Event> deserialize(final ByteBuffer buffer, final UUID
schemaId) throws IOException {
    final List<Event> events = new ArrayList<>();
    final Schema schema = getAvroSchema(schemaId);
    final BinaryDecoder decoder =
DecoderFactory.get().binaryDecoder(buffer.array(), null);
    final GenericDatumReader<GenericRecord> datumReader = new
GenericDatumReader<>(schema);
    GenericRecord record = new GenericData.Record(schema);

    // How do I loop?
    record  = datumReader.read(record, decoder);
    // populate Event object and add to list

    return events;
}


Thank you once again for your help!

Cheers
Pedro Cardoso

Research Data Engineer

pedro.cardoso@feedzai.com




[image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
<https://feedzai.com/>[image: Feedzai in Forbes Fintech 50!]
<https://www.forbes.com/fintech/list/>


On Wed, Jan 29, 2020 at 5:34 PM Ryan Skraba <ry...@skraba.com> wrote:

> Hello!
>
> It's a bit difficult to discover what's going wrong -- I'm not sure that
> the code in the image corresponds to the exception you are encountering!
> Notably, there's no reference to DataFileStream...  Typically, it would be
> easier with code as TXT than as PNG!
>
> It is definitely possible to serialize Avro GenericRecords into bytes!
> The example code looks like it's using the DataFileWriter (and ignoring the
> Encoder).  Keep in mind that this creates an Avro file (also known as a
> Avro Object Container file or .avro file).  This is more than just "pure"
> serialized bytes -- it contains some header information and sync markers,
> which makes it easier to split and process a single file on multiple nodes
> in big data.
>
> If you were to use a DatumWriter and an encoder, you could obtain just the
> "pure" binary data without any framing bytes.  If that is your goal, I
> suggest looking into the DatumWriter / DatumReader classes (as opposed to
> the DataFileXxx classes).
>
> From the given exception "Invalid sync" it looks like you might be writing
> pure Avro bytes and attempting to read the file format.
>
> Since the DatumWriter API uses OutputStream (instead of ByteBuffer),
> there's a utility class called ByteBufferOutputStream that you might find
> interesting.  It permits writing to a series of 8K java.nio.ByteBuffer
> instances, which might be OK for your use case.  There are other
> implementations of ByteBuffer-backed OutputStreams available that might be
> better suited.
>
> I hope this is useful, Ryan
>
>
> On Wed, Jan 29, 2020 at 4:22 PM Pedro Cardoso <pe...@feedzai.com>
> wrote:
>
>> Hello,
>>
>> I am trying to write a sequence of Avro GenericRecords into a Java
>> ByteBuffer and later on deserialize them. I have tried using
>> FileWriter/Readers and copying the content of the underlying buffer to my
>> target object. The alternative is to try to split a ByteBuffer by the
>> serialized GenericRecords individually and use a BinaryDecoder to read each
>> property of a record individually.
>>
>> Please see attached such an example of the former code.
>> The presented code fails with
>>
>> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
>> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:223)
>> at com.feedzai.research.experiments.bookkeeper.Avro.main(Avro.java:97)
>> Caused by: java.io.IOException: Invalid sync!
>> at
>> org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:318)
>> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:212)
>> ... 1 more
>>
>> Hence my questions are:
>>  - Is it at all possible to serialize/deserialize lists of Avro records
>> to a ByteBuffer and back?
>>  - If so, can anyone point me in the right direction?
>>  - If not, can anyone point me to code examples of alternative solutions?
>>
>> Thank you and have a good day.
>>
>> Pedro Cardoso
>>
>> Research Data Engineer
>>
>> pedro.cardoso@feedzai.com
>>
>>
>>
>>
>> [image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
>> Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
>> with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
>> <https://feedzai.com/>[image: Feedzai in Forbes Fintech 50!]
>> <https://www.forbes.com/fintech/list/>
>>
>> *The content of this email is confidential and intended for the recipient
>> specified in message only. It is strictly prohibited to share any part of
>> this message with any third party, without a written consent of the sender.
>> If you received this message by mistake, please reply to this message and
>> follow with its deletion, so that we can ensure such a mistake does not
>> occur in the future.*
>
>

-- 
The content of this email is confidential and 
intended for the recipient 
specified in message only. It is strictly 
prohibited to share any part of 
this message with any third party, 
without a written consent of the 
sender. If you received this message by
 mistake, please reply to this 
message and follow with its deletion, so 
that we can ensure such a mistake 
does not occur in the future.

Re: How to serialize & deserialize contiguous block of GenericRecords

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!

It's a bit difficult to discover what's going wrong -- I'm not sure that
the code in the image corresponds to the exception you are encountering!
Notably, there's no reference to DataFileStream...  Typically, it would be
easier with code as TXT than as PNG!

It is definitely possible to serialize Avro GenericRecords into bytes!  The
example code looks like it's using the DataFileWriter (and ignoring the
Encoder).  Keep in mind that this creates an Avro file (also known as a
Avro Object Container file or .avro file).  This is more than just "pure"
serialized bytes -- it contains some header information and sync markers,
which makes it easier to split and process a single file on multiple nodes
in big data.

If you were to use a DatumWriter and an encoder, you could obtain just the
"pure" binary data without any framing bytes.  If that is your goal, I
suggest looking into the DatumWriter / DatumReader classes (as opposed to
the DataFileXxx classes).

From the given exception "Invalid sync" it looks like you might be writing
pure Avro bytes and attempting to read the file format.

Since the DatumWriter API uses OutputStream (instead of ByteBuffer),
there's a utility class called ByteBufferOutputStream that you might find
interesting.  It permits writing to a series of 8K java.nio.ByteBuffer
instances, which might be OK for your use case.  There are other
implementations of ByteBuffer-backed OutputStreams available that might be
better suited.

I hope this is useful, Ryan


On Wed, Jan 29, 2020 at 4:22 PM Pedro Cardoso <pe...@feedzai.com>
wrote:

> Hello,
>
> I am trying to write a sequence of Avro GenericRecords into a Java
> ByteBuffer and later on deserialize them. I have tried using
> FileWriter/Readers and copying the content of the underlying buffer to my
> target object. The alternative is to try to split a ByteBuffer by the
> serialized GenericRecords individually and use a BinaryDecoder to read each
> property of a record individually.
>
> Please see attached such an example of the former code.
> The presented code fails with
>
> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:223)
> at com.feedzai.research.experiments.bookkeeper.Avro.main(Avro.java:97)
> Caused by: java.io.IOException: Invalid sync!
> at
> org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:318)
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:212)
> ... 1 more
>
> Hence my questions are:
>  - Is it at all possible to serialize/deserialize lists of Avro records to
> a ByteBuffer and back?
>  - If so, can anyone point me in the right direction?
>  - If not, can anyone point me to code examples of alternative solutions?
>
> Thank you and have a good day.
>
> Pedro Cardoso
>
> Research Data Engineer
>
> pedro.cardoso@feedzai.com
>
>
>
>
> [image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image:
> Follow Feedzai on Twitter!] <https://twitter.com/feedzai>[image: Connect
> with Feedzai on LinkedIn!] <https://www.linkedin.com/company/feedzai/>
> <https://feedzai.com/>[image: Feedzai in Forbes Fintech 50!]
> <https://www.forbes.com/fintech/list/>
>
> *The content of this email is confidential and intended for the recipient
> specified in message only. It is strictly prohibited to share any part of
> this message with any third party, without a written consent of the sender.
> If you received this message by mistake, please reply to this message and
> follow with its deletion, so that we can ensure such a mistake does not
> occur in the future.*