You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Bryce Alcock <u9...@gmail.com> on 2014/09/30 09:54:06 UTC

Sending Multiple Small Avro Messages to Flume in a Flume event

Not sure if I am approaching this problem correctly, But here is the basic
outline:

I would like to send say 10000, or even more small Avro messages in a
single Flume Event For storage on HDFS.

When I do this, it corrupts the "Avro" file created on HDFS because (I
assume based in a bit of reading) that it messes with the "Framing" that
Avro provides.

So the long and the short of it is that if I send, say 2, Flume events each
containing 10000 Avro Messages for storage on HDFS and stores the 2
"Packets of" of avro messages in a single file on HDFS (using the HDFS
sink), the first 10000 messages are readable, but the 10001 message is
corrupt.


I am doing this for performance purposes, I need to be sending about
1500*3600 = 5,400,000  (yes 5.4 million) small messages every ~4 seconds.

I know this is alot of messages....

I can produce the message at the correct rate, but I cannot flume them in
very fast because I have to create an "Flume Event" with a Avro Schema
attached to each message, so I thought if I could batch up a bunch of them
at once, It would be more efficient.

Thanks In Advacnce!

Q. Boiler

Re: Sending Multiple Small Avro Messages to Flume in a Flume event

Posted by Hari Shreedharan <hs...@cloudera.com>.
You could bunch these messages up into a single Flume event and then write a serializer that reads each of these Avro events and then writes them into an Avro container file (you can take a look at the AvroEventSerializer) - the downside of this is that you’d have to decode and re-encode the files in your serializer. 


Thanks,
Hari

On Tue, Sep 30, 2014 at 12:54 AM, Bryce Alcock <u9...@gmail.com>
wrote:

> Not sure if I am approaching this problem correctly, But here is the basic
> outline:
> I would like to send say 10000, or even more small Avro messages in a
> single Flume Event For storage on HDFS.
> When I do this, it corrupts the "Avro" file created on HDFS because (I
> assume based in a bit of reading) that it messes with the "Framing" that
> Avro provides.
> So the long and the short of it is that if I send, say 2, Flume events each
> containing 10000 Avro Messages for storage on HDFS and stores the 2
> "Packets of" of avro messages in a single file on HDFS (using the HDFS
> sink), the first 10000 messages are readable, but the 10001 message is
> corrupt.
> I am doing this for performance purposes, I need to be sending about
> 1500*3600 = 5,400,000  (yes 5.4 million) small messages every ~4 seconds.
> I know this is alot of messages....
> I can produce the message at the correct rate, but I cannot flume them in
> very fast because I have to create an "Flume Event" with a Avro Schema
> attached to each message, so I thought if I could batch up a bunch of them
> at once, It would be more efficient.
> Thanks In Advacnce!
> Q. Boiler