You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Nilesh Chhapru <ni...@ugamsolutions.com> on 2015/07/27 11:33:48 UTC

Cache Memory Kafka Process

Hi All,

I am facing issues with kafka broker process taking  a lot of cache
memory, just wanted to know if the process really need that much of
cache memory, or can i clear the OS level cache by setting a cron.

Regards,
Nilesh Chhapru.

Re: Cache Memory Kafka Process

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
On Tue, Jul 28, 2015 at 11:29 PM, Nilesh Chhapru <
nilesh.chhapru@ugamsolutions.com> wrote:

> Hi Ewen,
>
> Thanks for reply.
> The assumptions that you made for replication and partitions are
> correct, 120 is total number of partitions and replication factor is 1
> for all the topics.
>
> Does that mean that a broker will keep all the messages that are
> produced in memory, or will only the unconsumed messages.
>

The operating system is caching the data, not Kafka. So there's no policy
in Kafka that controls caching at that level. If you have consumers that
repeatedly consume old data, the operating system will cache those sections
of the files. If consumers are normally at the end of the logs, the
operating system will cache those parts of the log files. (In fact, this
doesn't even happen at the granularity of messages, this cache operates at
the granularity of pages: https://en.wikipedia.org/wiki/Page_cache)

But this is a good thing. If something else needed that memory, the OS
would just get rid of the cached data, opting to read the data back from
disk if it was needed again in the future. It's very unlikely that clearing
any of this data would affect the performance of your workload. If you're
seeing degradation of performance due to memory usage, it probably means
you're simply trying to access more data than fits in memory and end up
being limited by disk throughput as data needs to be reloaded.


>
> is there a way we can restrict this to only x number of messages or x MB
> of total data  in memory.
>

This works at the operating system level. You can adjust the retention
policies, which would just delete the data (and by definition that will
also take it out of cache), but you probably don't want to lose that data
completely.

Think of it this way: if you applied the type of restriction you're talking
about, what data would you have discarded? Are any of your applications
currently accessing the data that would have been discarded, e.g. because
they are resetting to the beginning of the log and scanning through the
full data set? If the answer is yes, then another way to view the situation
is that its your applications that are "misbehaving" in the sense that they
exhibit bad data access patterns that aren't actually required, resulting
in accessing more data than necessary which doesn't fit in memory an
therefore reduces your throughput.

-Ewen


>
> Regards,
> Nilesh Chhapru.
>
> On Tuesday 28 July 2015 12:37 PM, Ewen Cheslack-Postava wrote:
> > Nilesh,
> >
> > It's expected that a lot of memory is used for cache. This makes sense
> > because under the hood, Kafka mostly just reads and writes data to/from
> > files. While Kafka does manage some in-memory data, mostly it is writing
> > produced data (or replicated data) to log files and then serving those
> same
> > messages to consumers directly out of the log files. It relies on
> OS-level
> > file system caching optimize how data is managed. Operating systems are
> > already designed to do this well, so it's generally better to reuse this
> > functionality than to try to implement a custom caching layer.
> >
> > So when you see most of your memory consumed as cache, that's because the
> > OS has used the access patterns for data in those files to select which
> > parts of different files seem most likely to be useful in the future. As
> > Daniel's link points out, it's only doing this when that memory is not
> > needed for some other purpose.
> >
> > This approach isn't always perfect. If you have too much data to fit in
> > memory and you scan through it, performance will suffer. Eventually, you
> > will hit regions of files that are not in cache and the OS will be forced
> > to read those off disk, which is much slower than reading from cache.
> >
> > From your description I'm not sure if you have 120 partitions *per topic*
> > or *total* across all topics. Let's go with the lesser, 120 partitions
> > total. You also mention 3 brokers. Dividing 120 partitions across 3
> > brokers, we get about 40 partitions each broker is a leader for, which is
> > data it definitely needs cached in order to serve consumers. You didn't
> > mention the replication factor, so let's just ignore it here and assume
> the
> > lowest possible, only 1 copy of the data. Even so, it looks like you have
> > ~8GB of memory (based on the free -m numbers), and at 15 MB/message with
> 40
> > partitions per broker, that's only 8192/(15*40) = ~14 messages per
> > partition that would fit in memory, assuming it was all used for file
> > cache. That's not much, so if your total data stored is much larger and
> you
> > ever have to read through any old data, your throughput will likely
> suffer.
> >
> > It's hard to say much more without understanding what your workload is
> > like, if you're consuming data other than what the Storm spout is
> > consuming, the rate at which you're producing data, etc. However, my
> > initial impression is that you may be trying to process too much data
> with
> > too little memory and too little disk throughput.
> >
> > If you want more details, I'd suggest reading this section of the docs,
> > which further explains how a lot of this stuff works:
> > http://kafka.apache.org/documentation.html#persistence
> >
> > -Ewen
> >
> > On Mon, Jul 27, 2015 at 11:19 PM, Nilesh Chhapru <
> > nilesh.chhapru@ugamsolutions.com> wrote:
> >
> >> Hi Ewen,
> >>
> >> I am using 3 brokers with 12 topic and near about 120-125 partitions
> >> without any replication and the message size is approx 15MB/message.
> >>
> >> The problem is when the cache memory increases and reaches to the max
> >> available the performance starts degrading also i am using Storm spot as
> >> consumer which  stops reading at times.
> >>
> >> When i do a free -m on my broker node after 1/2 - 1 hr the memory foot
> >> print is as follows.
> >> 1) Physical memory - 500 MB - 600 MB
> >> 2) Cache Memory - 6.5 GB
> >> 3) Free Memory - 50 - 60 MB
> >>
> >> Regards,
> >> Nilesh Chhapru.
> >>
> >> On Monday 27 July 2015 11:02 PM, Ewen Cheslack-Postava wrote:
> >>> Having the OS cache the data in Kafka's log files is useful since it
> >> means
> >>> that data doesn't need to be read back from disk when consumed. This is
> >>> good for the latency and throughput of consumers. Usually this caching
> >>> works out pretty well, keeping the latest data from your topics in
> cache
> >>> and only pulling older data into memory if a consumer reads data from
> >>> earlier in the log. In other words, by leveraging OS-level caching of
> >>> files, Kafka gets an in-memory caching layer for free.
> >>>
> >>> Generally you shouldn't need to clear this data -- the OS should only
> be
> >>> using memory that isn't being used anyway. Is there a particular
> problem
> >>> you're encountering that clearing the cache would help with?
> >>>
> >>> -Ewen
> >>>
> >>> On Mon, Jul 27, 2015 at 2:33 AM, Nilesh Chhapru <
> >>> nilesh.chhapru@ugamsolutions.com> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I am facing issues with kafka broker process taking  a lot of cache
> >>>> memory, just wanted to know if the process really need that much of
> >>>> cache memory, or can i clear the OS level cache by setting a cron.
> >>>>
> >>>> Regards,
> >>>> Nilesh Chhapru.
> >>>>
> >>>
> >>
> >
>
>


-- 
Thanks,
Ewen

Re: Cache Memory Kafka Process

Posted by Nilesh Chhapru <ni...@ugamsolutions.com>.
Hi Ewen,

Thanks for reply.
The assumptions that you made for replication and partitions are
correct, 120 is total number of partitions and replication factor is 1
for all the topics.

Does that mean that a broker will keep all the messages that are
produced in memory, or will only the unconsumed messages.

is there a way we can restrict this to only x number of messages or x MB
of total data  in memory.

Regards,
Nilesh Chhapru.

On Tuesday 28 July 2015 12:37 PM, Ewen Cheslack-Postava wrote:
> Nilesh,
>
> It's expected that a lot of memory is used for cache. This makes sense
> because under the hood, Kafka mostly just reads and writes data to/from
> files. While Kafka does manage some in-memory data, mostly it is writing
> produced data (or replicated data) to log files and then serving those same
> messages to consumers directly out of the log files. It relies on OS-level
> file system caching optimize how data is managed. Operating systems are
> already designed to do this well, so it's generally better to reuse this
> functionality than to try to implement a custom caching layer.
>
> So when you see most of your memory consumed as cache, that's because the
> OS has used the access patterns for data in those files to select which
> parts of different files seem most likely to be useful in the future. As
> Daniel's link points out, it's only doing this when that memory is not
> needed for some other purpose.
>
> This approach isn't always perfect. If you have too much data to fit in
> memory and you scan through it, performance will suffer. Eventually, you
> will hit regions of files that are not in cache and the OS will be forced
> to read those off disk, which is much slower than reading from cache.
>
> From your description I'm not sure if you have 120 partitions *per topic*
> or *total* across all topics. Let's go with the lesser, 120 partitions
> total. You also mention 3 brokers. Dividing 120 partitions across 3
> brokers, we get about 40 partitions each broker is a leader for, which is
> data it definitely needs cached in order to serve consumers. You didn't
> mention the replication factor, so let's just ignore it here and assume the
> lowest possible, only 1 copy of the data. Even so, it looks like you have
> ~8GB of memory (based on the free -m numbers), and at 15 MB/message with 40
> partitions per broker, that's only 8192/(15*40) = ~14 messages per
> partition that would fit in memory, assuming it was all used for file
> cache. That's not much, so if your total data stored is much larger and you
> ever have to read through any old data, your throughput will likely suffer.
>
> It's hard to say much more without understanding what your workload is
> like, if you're consuming data other than what the Storm spout is
> consuming, the rate at which you're producing data, etc. However, my
> initial impression is that you may be trying to process too much data with
> too little memory and too little disk throughput.
>
> If you want more details, I'd suggest reading this section of the docs,
> which further explains how a lot of this stuff works:
> http://kafka.apache.org/documentation.html#persistence
>
> -Ewen
>
> On Mon, Jul 27, 2015 at 11:19 PM, Nilesh Chhapru <
> nilesh.chhapru@ugamsolutions.com> wrote:
>
>> Hi Ewen,
>>
>> I am using 3 brokers with 12 topic and near about 120-125 partitions
>> without any replication and the message size is approx 15MB/message.
>>
>> The problem is when the cache memory increases and reaches to the max
>> available the performance starts degrading also i am using Storm spot as
>> consumer which  stops reading at times.
>>
>> When i do a free -m on my broker node after 1/2 - 1 hr the memory foot
>> print is as follows.
>> 1) Physical memory - 500 MB - 600 MB
>> 2) Cache Memory - 6.5 GB
>> 3) Free Memory - 50 - 60 MB
>>
>> Regards,
>> Nilesh Chhapru.
>>
>> On Monday 27 July 2015 11:02 PM, Ewen Cheslack-Postava wrote:
>>> Having the OS cache the data in Kafka's log files is useful since it
>> means
>>> that data doesn't need to be read back from disk when consumed. This is
>>> good for the latency and throughput of consumers. Usually this caching
>>> works out pretty well, keeping the latest data from your topics in cache
>>> and only pulling older data into memory if a consumer reads data from
>>> earlier in the log. In other words, by leveraging OS-level caching of
>>> files, Kafka gets an in-memory caching layer for free.
>>>
>>> Generally you shouldn't need to clear this data -- the OS should only be
>>> using memory that isn't being used anyway. Is there a particular problem
>>> you're encountering that clearing the cache would help with?
>>>
>>> -Ewen
>>>
>>> On Mon, Jul 27, 2015 at 2:33 AM, Nilesh Chhapru <
>>> nilesh.chhapru@ugamsolutions.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am facing issues with kafka broker process taking  a lot of cache
>>>> memory, just wanted to know if the process really need that much of
>>>> cache memory, or can i clear the OS level cache by setting a cron.
>>>>
>>>> Regards,
>>>> Nilesh Chhapru.
>>>>
>>>
>>
>


Re: Cache Memory Kafka Process

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
Nilesh,

It's expected that a lot of memory is used for cache. This makes sense
because under the hood, Kafka mostly just reads and writes data to/from
files. While Kafka does manage some in-memory data, mostly it is writing
produced data (or replicated data) to log files and then serving those same
messages to consumers directly out of the log files. It relies on OS-level
file system caching optimize how data is managed. Operating systems are
already designed to do this well, so it's generally better to reuse this
functionality than to try to implement a custom caching layer.

So when you see most of your memory consumed as cache, that's because the
OS has used the access patterns for data in those files to select which
parts of different files seem most likely to be useful in the future. As
Daniel's link points out, it's only doing this when that memory is not
needed for some other purpose.

This approach isn't always perfect. If you have too much data to fit in
memory and you scan through it, performance will suffer. Eventually, you
will hit regions of files that are not in cache and the OS will be forced
to read those off disk, which is much slower than reading from cache.

>From your description I'm not sure if you have 120 partitions *per topic*
or *total* across all topics. Let's go with the lesser, 120 partitions
total. You also mention 3 brokers. Dividing 120 partitions across 3
brokers, we get about 40 partitions each broker is a leader for, which is
data it definitely needs cached in order to serve consumers. You didn't
mention the replication factor, so let's just ignore it here and assume the
lowest possible, only 1 copy of the data. Even so, it looks like you have
~8GB of memory (based on the free -m numbers), and at 15 MB/message with 40
partitions per broker, that's only 8192/(15*40) = ~14 messages per
partition that would fit in memory, assuming it was all used for file
cache. That's not much, so if your total data stored is much larger and you
ever have to read through any old data, your throughput will likely suffer.

It's hard to say much more without understanding what your workload is
like, if you're consuming data other than what the Storm spout is
consuming, the rate at which you're producing data, etc. However, my
initial impression is that you may be trying to process too much data with
too little memory and too little disk throughput.

If you want more details, I'd suggest reading this section of the docs,
which further explains how a lot of this stuff works:
http://kafka.apache.org/documentation.html#persistence

-Ewen

On Mon, Jul 27, 2015 at 11:19 PM, Nilesh Chhapru <
nilesh.chhapru@ugamsolutions.com> wrote:

> Hi Ewen,
>
> I am using 3 brokers with 12 topic and near about 120-125 partitions
> without any replication and the message size is approx 15MB/message.
>
> The problem is when the cache memory increases and reaches to the max
> available the performance starts degrading also i am using Storm spot as
> consumer which  stops reading at times.
>
> When i do a free -m on my broker node after 1/2 - 1 hr the memory foot
> print is as follows.
> 1) Physical memory - 500 MB - 600 MB
> 2) Cache Memory - 6.5 GB
> 3) Free Memory - 50 - 60 MB
>
> Regards,
> Nilesh Chhapru.
>
> On Monday 27 July 2015 11:02 PM, Ewen Cheslack-Postava wrote:
> > Having the OS cache the data in Kafka's log files is useful since it
> means
> > that data doesn't need to be read back from disk when consumed. This is
> > good for the latency and throughput of consumers. Usually this caching
> > works out pretty well, keeping the latest data from your topics in cache
> > and only pulling older data into memory if a consumer reads data from
> > earlier in the log. In other words, by leveraging OS-level caching of
> > files, Kafka gets an in-memory caching layer for free.
> >
> > Generally you shouldn't need to clear this data -- the OS should only be
> > using memory that isn't being used anyway. Is there a particular problem
> > you're encountering that clearing the cache would help with?
> >
> > -Ewen
> >
> > On Mon, Jul 27, 2015 at 2:33 AM, Nilesh Chhapru <
> > nilesh.chhapru@ugamsolutions.com> wrote:
> >
> >> Hi All,
> >>
> >> I am facing issues with kafka broker process taking  a lot of cache
> >> memory, just wanted to know if the process really need that much of
> >> cache memory, or can i clear the OS level cache by setting a cron.
> >>
> >> Regards,
> >> Nilesh Chhapru.
> >>
> >
> >
>
>


-- 
Thanks,
Ewen

Re: Cache Memory Kafka Process

Posted by Nilesh Chhapru <ni...@ugamsolutions.com>.
Hi Ewen,

I am using 3 brokers with 12 topic and near about 120-125 partitions
without any replication and the message size is approx 15MB/message.

The problem is when the cache memory increases and reaches to the max
available the performance starts degrading also i am using Storm spot as
consumer which  stops reading at times.

When i do a free -m on my broker node after 1/2 - 1 hr the memory foot
print is as follows.
1) Physical memory - 500 MB - 600 MB
2) Cache Memory - 6.5 GB
3) Free Memory - 50 - 60 MB

Regards,
Nilesh Chhapru.

On Monday 27 July 2015 11:02 PM, Ewen Cheslack-Postava wrote:
> Having the OS cache the data in Kafka's log files is useful since it means
> that data doesn't need to be read back from disk when consumed. This is
> good for the latency and throughput of consumers. Usually this caching
> works out pretty well, keeping the latest data from your topics in cache
> and only pulling older data into memory if a consumer reads data from
> earlier in the log. In other words, by leveraging OS-level caching of
> files, Kafka gets an in-memory caching layer for free.
>
> Generally you shouldn't need to clear this data -- the OS should only be
> using memory that isn't being used anyway. Is there a particular problem
> you're encountering that clearing the cache would help with?
>
> -Ewen
>
> On Mon, Jul 27, 2015 at 2:33 AM, Nilesh Chhapru <
> nilesh.chhapru@ugamsolutions.com> wrote:
>
>> Hi All,
>>
>> I am facing issues with kafka broker process taking  a lot of cache
>> memory, just wanted to know if the process really need that much of
>> cache memory, or can i clear the OS level cache by setting a cron.
>>
>> Regards,
>> Nilesh Chhapru.
>>
>
>


Re: Cache Memory Kafka Process

Posted by Daniel Compton <da...@gmail.com>.
http://www.linuxatemyram.com may be a helpful resource to explain this
better.
On Tue, 28 Jul 2015 at 5:32 AM Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> Having the OS cache the data in Kafka's log files is useful since it means
> that data doesn't need to be read back from disk when consumed. This is
> good for the latency and throughput of consumers. Usually this caching
> works out pretty well, keeping the latest data from your topics in cache
> and only pulling older data into memory if a consumer reads data from
> earlier in the log. In other words, by leveraging OS-level caching of
> files, Kafka gets an in-memory caching layer for free.
>
> Generally you shouldn't need to clear this data -- the OS should only be
> using memory that isn't being used anyway. Is there a particular problem
> you're encountering that clearing the cache would help with?
>
> -Ewen
>
> On Mon, Jul 27, 2015 at 2:33 AM, Nilesh Chhapru <
> nilesh.chhapru@ugamsolutions.com> wrote:
>
> > Hi All,
> >
> > I am facing issues with kafka broker process taking  a lot of cache
> > memory, just wanted to know if the process really need that much of
> > cache memory, or can i clear the OS level cache by setting a cron.
> >
> > Regards,
> > Nilesh Chhapru.
> >
>
>
>
> --
> Thanks,
> Ewen
>
-- 
--
Daniel

Re: Cache Memory Kafka Process

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
Having the OS cache the data in Kafka's log files is useful since it means
that data doesn't need to be read back from disk when consumed. This is
good for the latency and throughput of consumers. Usually this caching
works out pretty well, keeping the latest data from your topics in cache
and only pulling older data into memory if a consumer reads data from
earlier in the log. In other words, by leveraging OS-level caching of
files, Kafka gets an in-memory caching layer for free.

Generally you shouldn't need to clear this data -- the OS should only be
using memory that isn't being used anyway. Is there a particular problem
you're encountering that clearing the cache would help with?

-Ewen

On Mon, Jul 27, 2015 at 2:33 AM, Nilesh Chhapru <
nilesh.chhapru@ugamsolutions.com> wrote:

> Hi All,
>
> I am facing issues with kafka broker process taking  a lot of cache
> memory, just wanted to know if the process really need that much of
> cache memory, or can i clear the OS level cache by setting a cron.
>
> Regards,
> Nilesh Chhapru.
>



-- 
Thanks,
Ewen