You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Denny Lee <de...@gmail.com> on 2014/06/24 17:35:24 UTC

Experiences with larger message sizes

By any chance has anyone worked with using Kafka with message sizes that are approximately 50MB in size?  Based on from some of the previous threads there are probably some concerns on memory pressure due to the compression on the broker and decompression on the consumer and a best practices on ensuring batch size (to ultimately not have the compressed message exceed message size limit).  

Any other best practices or thoughts concerning this scenario?

Thanks!
Denny

Re: Experiences with larger message sizes

Posted by Denny Lee <de...@gmail.com>.

Thanks for the info Joe - yes, I do think this will be very useful. Will look out for this, eh?!

On June 24, 2014 at 10:32:08 AM, Joe Stein (joe.stein@stealth.ly) wrote:

You could then chunk the data (wrapped in an outer message so you have meta data like file name, total size, current chunk size) and produce that with the partition key being filename.

We are in progress working on a system for doing file loading to Kafka (which will eventually support both chunked and pointers [initially chunking line by line since use case 1 is to read from a closed file handle location]) https://github.com/stealthly/f2k (there is not much there yet maybe in the next few days / later this week) maybe useful for your use case or we could eventually add your use case to it.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop
********************************************/

On Tue, Jun 24, 2014 at 12:37 PM, Denny Lee <de...@gmail.com> wrote:
Hey Joe,

Yes, I have - my original plan is to do something similar to what you suggested which was to simply push the data into HDFS / S3 and then having only the event information within Kafka so that way multiple consumers can just read the event information and ping HDFS/S3 for the actual message itself.  

Part of the reason for considering just pushing the entire message up is due to the potential where we will have a firehose of messages of this size and we will need to push this data to multiple locations.

Thanks,
Denny

On June 24, 2014 at 9:26:49 AM, Joe Stein (joe.stein@stealth.ly) wrote:

Hi Denny, have you considered saving those files to HDFS and sending the
"event" information to Kafka?

You could then pass that off to Apache Spark in a consumer and get data
locality for the file saved (or something of the sort [no pun intended]).

You could also stream every line (or however you want to "chunk" it) in the
file as a separate message to the broker with a wrapping message object (so
you know which file you are dealing with when consuming).

What you plan to-do with the data has a lot to-do with how you are going to
process and manage it.

/*******************************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/

On Tue, Jun 24, 2014 at 11:35 AM, Denny Lee <de...@gmail.com> wrote:

> By any chance has anyone worked with using Kafka with message sizes that
> are approximately 50MB in size? Based on from some of the previous threads
> there are probably some concerns on memory pressure due to the compression
> on the broker and decompression on the consumer and a best practices on
> ensuring batch size (to ultimately not have the compressed message exceed
> message size limit).
>
> Any other best practices or thoughts concerning this scenario?
>
> Thanks!
> Denny
>
>

Re: Experiences with larger message sizes

Posted by Joe Stein <jo...@stealth.ly>.

You could then chunk the data (wrapped in an outer message so you have meta
data like file name, total size, current chunk size) and produce that with
the partition key being filename.

We are in progress working on a system for doing file loading to Kafka
(which will eventually support both chunked and pointers [initially
chunking line by line since use case 1 is to read from a closed file handle
location]) https://github.com/stealthly/f2k (there is not much there yet
maybe in the next few days / later this week) maybe useful for your use
case or we could eventually add your use case to it.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Tue, Jun 24, 2014 at 12:37 PM, Denny Lee <de...@gmail.com> wrote:

> Hey Joe,
>
> Yes, I have - my original plan is to do something similar to what you
> suggested which was to simply push the data into HDFS / S3 and then having
> only the event information within Kafka so that way multiple consumers can
> just read the event information and ping HDFS/S3 for the actual message
> itself.
>
> Part of the reason for considering just pushing the entire message up is
> due to the potential where we will have a firehose of messages of this size
> and we will need to push this data to multiple locations.
>
> Thanks,
> Denny
>
> On June 24, 2014 at 9:26:49 AM, Joe Stein (joe.stein@stealth.ly) wrote:
>
> Hi Denny, have you considered saving those files to HDFS and sending the
> "event" information to Kafka?
>
> You could then pass that off to Apache Spark in a consumer and get data
> locality for the file saved (or something of the sort [no pun intended]).
>
> You could also stream every line (or however you want to "chunk" it) in
> the
> file as a separate message to the broker with a wrapping message object
> (so
> you know which file you are dealing with when consuming).
>
> What you plan to-do with the data has a lot to-do with how you are going
> to
> process and manage it.
>
> /*******************************************
> Joe Stein
> Founder, Principal Consultant
> Big Data Open Source Security LLC
> http://www.stealth.ly
> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>
>
> On Tue, Jun 24, 2014 at 11:35 AM, Denny Lee <de...@gmail.com>
> wrote:
>
> > By any chance has anyone worked with using Kafka with message sizes that
> > are approximately 50MB in size? Based on from some of the previous
> threads
> > there are probably some concerns on memory pressure due to the
> compression
> > on the broker and decompression on the consumer and a best practices on
> > ensuring batch size (to ultimately not have the compressed message
> exceed
> > message size limit).
> >
> > Any other best practices or thoughts concerning this scenario?
> >
> > Thanks!
> > Denny
> >
> >
>
>

Re: Experiences with larger message sizes

Posted by Denny Lee <de...@gmail.com>.

Hey Joe,

Yes, I have - my original plan is to do something similar to what you suggested which was to simply push the data into HDFS / S3 and then having only the event information within Kafka so that way multiple consumers can just read the event information and ping HDFS/S3 for the actual message itself.  

Part of the reason for considering just pushing the entire message up is due to the potential where we will have a firehose of messages of this size and we will need to push this data to multiple locations.

Thanks,
Denny

On June 24, 2014 at 9:26:49 AM, Joe Stein (joe.stein@stealth.ly) wrote:

Hi Denny, have you considered saving those files to HDFS and sending the  
"event" information to Kafka?  

You could then pass that off to Apache Spark in a consumer and get data  
locality for the file saved (or something of the sort [no pun intended]).  

You could also stream every line (or however you want to "chunk" it) in the  
file as a separate message to the broker with a wrapping message object (so  
you know which file you are dealing with when consuming).  

What you plan to-do with the data has a lot to-do with how you are going to  
process and manage it.  

/*******************************************  
Joe Stein  
Founder, Principal Consultant  
Big Data Open Source Security LLC  
http://www.stealth.ly  
Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>  
********************************************/  

On Tue, Jun 24, 2014 at 11:35 AM, Denny Lee <de...@gmail.com> wrote:  

> By any chance has anyone worked with using Kafka with message sizes that  
> are approximately 50MB in size? Based on from some of the previous threads  
> there are probably some concerns on memory pressure due to the compression  
> on the broker and decompression on the consumer and a best practices on  
> ensuring batch size (to ultimately not have the compressed message exceed  
> message size limit).  
>  
> Any other best practices or thoughts concerning this scenario?  
>  
> Thanks!  
> Denny  
>  
>

Re: Experiences with larger message sizes

Posted by Joe Stein <jo...@stealth.ly>.

Hi Denny, have you considered saving those files to HDFS and sending the
"event" information to Kafka?

You could then pass that off to Apache Spark in a consumer and get data
locality for the file saved (or something of the sort [no pun intended]).

You could also stream every line (or however you want to "chunk" it) in the
file as a separate message to the broker with a wrapping message object (so
you know which file you are dealing with when consuming).

What you plan to-do with the data has a lot to-do with how you are going to
process and manage it.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/

On Tue, Jun 24, 2014 at 11:35 AM, Denny Lee <de...@gmail.com> wrote:

> By any chance has anyone worked with using Kafka with message sizes that
> are approximately 50MB in size?  Based on from some of the previous threads
> there are probably some concerns on memory pressure due to the compression
> on the broker and decompression on the consumer and a best practices on
> ensuring batch size (to ultimately not have the compressed message exceed
> message size limit).
>
> Any other best practices or thoughts concerning this scenario?
>
> Thanks!
> Denny
>
>

Re: Experiences with larger message sizes

Posted by Denny Lee <de...@gmail.com>.

Yes,  thanks very much Luke - this is very helpful for my plans.  I was under the same impression but it’s always good to have verification, eh?! 


On June 26, 2014 at 4:48:03 PM, Bert Corderman (bertcord@gmail.com) wrote:

Thanks for the details Luke.  

At what point would you consider a message too big?  

Are you using compression?  

Bert  

On Thursday, June 26, 2014, Luke Forehand <  
luke.forehand@networkedinsights.com> wrote:  

> I have used 50MB message size and it is not a great idea. First of all  
> you need to make sure you have these settings in sync:  
> message.max.bytes  
> replica.fetch.max.bytes  
> fetch.message.max.bytes  
>  
> I had not set the replica fetch setting and didn't realize one of my  
> partitions was not replicating after a large message was produced. I also  
> ran into heap issues with having to fetch such a large message, lots of  
> unnecessary garbage collection. I suggest breaking down your message into  
> smaller chunks. In my case, I decided to break an XML input stream (which  
> had a root element wrapping a ridiculously large number of children) into  
> smaller messages, having to parse the large xml root document and re-wrap  
> each child element with a shallow clone of its parents as I iterated the  
> stream.  
>  
> -Luke  
>  
> ________________________________________  
> From: Denny Lee <denny.g.lee@gmail.com <javascript:;>>  
> Sent: Tuesday, June 24, 2014 10:35 AM  
> To: users@kafka.apache.org <javascript:;>  
> Subject: Experiences with larger message sizes  
>  
> By any chance has anyone worked with using Kafka with message sizes that  
> are approximately 50MB in size? Based on from some of the previous threads  
> there are probably some concerns on memory pressure due to the compression  
> on the broker and decompression on the consumer and a best practices on  
> ensuring batch size (to ultimately not have the compressed message exceed  
> message size limit).  
>  
> Any other best practices or thoughts concerning this scenario?  
>  
> Thanks!  
> Denny  
>

Re: Experiences with larger message sizes

Posted by Luke Forehand <lu...@networkedinsights.com>.

I am using gzip compression.  Too big is really difficult to define
because it always depends (for example what can your hardware handle), but
I would say no more than a few megabytes.  Having said that we are still
successfully using 50MB size in production for some things, but it comes
at a cost.  It requires us to tune each consumer individually and keep
these consumers separated (not within the same jvm) for SLA reasons.

-Luke




On 6/26/14, 6:47 PM, "Bert Corderman" <be...@gmail.com> wrote:

>Thanks for the details Luke.
>
>At what point would you consider a message too big?
>
>Are you using compression?
>
>Bert
>
>On Thursday, June 26, 2014, Luke Forehand <
>luke.forehand@networkedinsights.com> wrote:
>
>> I have used 50MB message size and it is not a great idea.  First of all
>> you need to make sure you have these settings in sync:
>> message.max.bytes
>> replica.fetch.max.bytes
>> fetch.message.max.bytes
>>
>> I had not set the replica fetch setting and didn't realize one of my
>> partitions was not replicating after a large message was produced.  I
>>also
>> ran into heap issues with having to fetch such a large message, lots of
>> unnecessary garbage collection.  I suggest breaking down your message
>>into
>> smaller chunks.  In my case, I decided to break an XML input stream
>>(which
>> had a root element wrapping a ridiculously large number of children)
>>into
>> smaller messages, having to parse the large xml root document and
>>re-wrap
>> each child element with a shallow clone of its parents as I iterated the
>> stream.
>>
>> -Luke
>>
>> ________________________________________
>> From: Denny Lee <denny.g.lee@gmail.com <javascript:;>>
>> Sent: Tuesday, June 24, 2014 10:35 AM
>> To: users@kafka.apache.org <javascript:;>
>> Subject: Experiences with larger message sizes
>>
>> By any chance has anyone worked with using Kafka with message sizes that
>> are approximately 50MB in size?  Based on from some of the previous
>>threads
>> there are probably some concerns on memory pressure due to the
>>compression
>> on the broker and decompression on the consumer and a best practices on
>> ensuring batch size (to ultimately not have the compressed message
>>exceed
>> message size limit).
>>
>> Any other best practices or thoughts concerning this scenario?
>>
>> Thanks!
>> Denny
>>

Re: Experiences with larger message sizes

Posted by Bert Corderman <be...@gmail.com>.

Thanks for the details Luke.

At what point would you consider a message too big?

Are you using compression?

Bert

On Thursday, June 26, 2014, Luke Forehand <
luke.forehand@networkedinsights.com> wrote:

> I have used 50MB message size and it is not a great idea.  First of all
> you need to make sure you have these settings in sync:
> message.max.bytes
> replica.fetch.max.bytes
> fetch.message.max.bytes
>
> I had not set the replica fetch setting and didn't realize one of my
> partitions was not replicating after a large message was produced.  I also
> ran into heap issues with having to fetch such a large message, lots of
> unnecessary garbage collection.  I suggest breaking down your message into
> smaller chunks.  In my case, I decided to break an XML input stream (which
> had a root element wrapping a ridiculously large number of children) into
> smaller messages, having to parse the large xml root document and re-wrap
> each child element with a shallow clone of its parents as I iterated the
> stream.
>
> -Luke
>
> ________________________________________
> From: Denny Lee <denny.g.lee@gmail.com <javascript:;>>
> Sent: Tuesday, June 24, 2014 10:35 AM
> To: users@kafka.apache.org <javascript:;>
> Subject: Experiences with larger message sizes
>
> By any chance has anyone worked with using Kafka with message sizes that
> are approximately 50MB in size?  Based on from some of the previous threads
> there are probably some concerns on memory pressure due to the compression
> on the broker and decompression on the consumer and a best practices on
> ensuring batch size (to ultimately not have the compressed message exceed
> message size limit).
>
> Any other best practices or thoughts concerning this scenario?
>
> Thanks!
> Denny
>

RE: Experiences with larger message sizes

Posted by Luke Forehand <lu...@networkedinsights.com>.

I have used 50MB message size and it is not a great idea.  First of all you need to make sure you have these settings in sync:
message.max.bytes
replica.fetch.max.bytes
fetch.message.max.bytes

I had not set the replica fetch setting and didn't realize one of my partitions was not replicating after a large message was produced.  I also ran into heap issues with having to fetch such a large message, lots of unnecessary garbage collection.  I suggest breaking down your message into smaller chunks.  In my case, I decided to break an XML input stream (which had a root element wrapping a ridiculously large number of children) into smaller messages, having to parse the large xml root document and re-wrap each child element with a shallow clone of its parents as I iterated the stream.  

-Luke

________________________________________
From: Denny Lee <de...@gmail.com>
Sent: Tuesday, June 24, 2014 10:35 AM
To: users@kafka.apache.org
Subject: Experiences with larger message sizes

By any chance has anyone worked with using Kafka with message sizes that are approximately 50MB in size?  Based on from some of the previous threads there are probably some concerns on memory pressure due to the compression on the broker and decompression on the consumer and a best practices on ensuring batch size (to ultimately not have the compressed message exceed message size limit).

Any other best practices or thoughts concerning this scenario?

Thanks!
Denny