You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Andrea Giordano <an...@gmail.com> on 2017/10/25 15:44:31 UTC

"Persistence strategy" clarification

Hi,
I’m looking at Kafka documentation, in particular at Persistence section:

https://kafka.apache.org/documentation/#persistence

If I understood it says that Kafka writes on disks data as they arrive instead of use RAM. It sounds really strange to me (writes on disks are not heavy operations?) but clearly I trust kafka developers. I would like to have a confirm of that.

Assuming it and to verify it I executed a simple task with a data stream of 500kb/s for some minutes on a  machine with 4GB-200GB and I printed graphs of ram memory usage(%) and disk space usage (MB) you can find here:


RAM : https://ibb.co/mzYD5m 

DISK SPACE: https://ibb.co/coAMrR 

The stream is ingested at second 125 and finish at second around 870.

Accordingly to what I understood, I expected to see a linear decreasing graph (due to the gradually occupation of space as data arrive) about disk space usage, instead I’m not able to explain why there are these plain regions which indicate no other space is occupied in these seconds. 

Thank you,
Andrea




Re: "Persistence strategy" clarification

Posted by Andrea Giordano <an...@gmail.com>.
Ok, it has sense.
So I have to think that the Kafka write strategy (writing directly to the disk if I understood) is “overwritten” by the Linux page-caching strategy? 
Anyway this invalid everything because it store data on the RAM before using disk space... 

> On 25 Oct 2017, at 20:51, Fabio Yamada <fa...@gmail.com> wrote:
> 
> Hi Andrea,
> 
> Although I'm not experienced in Kafka I could find reference in the docs to
> explain the behavior you experienced in the graph analysis. Following an
> excerpt from the official documentation:
> 
> Understanding Linux OS Flush Behavior
> <http://kafka.apache.org/documentation/#linuxflush>
> 
> In Linux, data written to the filesystem is maintained in pagecache until
> it must be written out to disk (due to an application-level fsync or the
> OS's own flush policy). The flushing of data is done by a set of background
> threads called pdflush (or in post 2.6.32 kernels "flusher threads").
> Pdflush has a configurable policy that controls how much dirty data can be
> maintained in cache and for how long before it must be written back to
> disk. This policy is described here. When Pdflush cannot keep up with the
> rate of data being written it will eventually cause the writing process to
> block incurring latency in the writes to slow down the accumulation of data.
> 
> Basically the flat area you are seeing in the graph is the pagecache
> building up by the OS. Then the data is flushed to disk at once and the
> cycle repeats.
> 
> Regards,
> Yamada
> 
> On Wed, Oct 25, 2017 at 1:44 PM, Andrea Giordano <
> andrea.giordano.inf@gmail.com> wrote:
> 
>> Hi,
>> I’m looking at Kafka documentation, in particular at Persistence section:
>> 
>> https://kafka.apache.org/documentation/#persistence
>> 
>> If I understood it says that Kafka writes on disks data as they arrive
>> instead of use RAM. It sounds really strange to me (writes on disks are not
>> heavy operations?) but clearly I trust kafka developers. I would like to
>> have a confirm of that.
>> 
>> Assuming it and to verify it I executed a simple task with a data stream
>> of 500kb/s for some minutes on a  machine with 4GB-200GB and I printed
>> graphs of ram memory usage(%) and disk space usage (MB) you can find here:
>> 
>> 
>> RAM : https://ibb.co/mzYD5m
>> 
>> DISK SPACE: https://ibb.co/coAMrR
>> 
>> The stream is ingested at second 125 and finish at second around 870.
>> 
>> Accordingly to what I understood, I expected to see a linear decreasing
>> graph (due to the gradually occupation of space as data arrive) about disk
>> space usage, instead I’m not able to explain why there are these plain
>> regions which indicate no other space is occupied in these seconds.
>> 
>> Thank you,
>> Andrea
>> 
>> 
>> 
>> 


Re: "Persistence strategy" clarification

Posted by Fabio Yamada <fa...@gmail.com>.
Hi Andrea,

Although I'm not experienced in Kafka I could find reference in the docs to
explain the behavior you experienced in the graph analysis. Following an
excerpt from the official documentation:

Understanding Linux OS Flush Behavior
<http://kafka.apache.org/documentation/#linuxflush>

In Linux, data written to the filesystem is maintained in pagecache until
it must be written out to disk (due to an application-level fsync or the
OS's own flush policy). The flushing of data is done by a set of background
threads called pdflush (or in post 2.6.32 kernels "flusher threads").
Pdflush has a configurable policy that controls how much dirty data can be
maintained in cache and for how long before it must be written back to
disk. This policy is described here. When Pdflush cannot keep up with the
rate of data being written it will eventually cause the writing process to
block incurring latency in the writes to slow down the accumulation of data.

Basically the flat area you are seeing in the graph is the pagecache
building up by the OS. Then the data is flushed to disk at once and the
cycle repeats.

Regards,
Yamada

On Wed, Oct 25, 2017 at 1:44 PM, Andrea Giordano <
andrea.giordano.inf@gmail.com> wrote:

> Hi,
> I’m looking at Kafka documentation, in particular at Persistence section:
>
> https://kafka.apache.org/documentation/#persistence
>
> If I understood it says that Kafka writes on disks data as they arrive
> instead of use RAM. It sounds really strange to me (writes on disks are not
> heavy operations?) but clearly I trust kafka developers. I would like to
> have a confirm of that.
>
> Assuming it and to verify it I executed a simple task with a data stream
> of 500kb/s for some minutes on a  machine with 4GB-200GB and I printed
> graphs of ram memory usage(%) and disk space usage (MB) you can find here:
>
>
> RAM : https://ibb.co/mzYD5m
>
> DISK SPACE: https://ibb.co/coAMrR
>
> The stream is ingested at second 125 and finish at second around 870.
>
> Accordingly to what I understood, I expected to see a linear decreasing
> graph (due to the gradually occupation of space as data arrive) about disk
> space usage, instead I’m not able to explain why there are these plain
> regions which indicate no other space is occupied in these seconds.
>
> Thank you,
> Andrea
>
>
>
>