You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jay Kreps <ja...@gmail.com> on 2012/05/24 19:40:42 UTC

Solution for blocking fsync in 0.8

One issue with using the filesystem for persistence is that the
synchronization in the filesystem is not great. In particular the fsync and
fsyncdata system calls block appends to the file, apparently for the entire
duration of the fsync (which can be quite long). This is documented in some
detail here:
  http://antirez.com/post/fsync-different-thread-useless.html

This is a problem in 0.7 because our definition of a committed message is
one written prior to calling fsync(). This is the only way to guarantee the
message is on disk. We do not hand out any messages to consumers until an
fsync call occurs. The problem is that regardless of whether the fsync is
in a background thread or not it will block any produce requests to the
file. This is buffered a bit in the client since our produce request is
effectively async in 0.7, but it can lead to weird latency spikes
nontheless as this buffering gets filled.

In 0.8 with replication the definition of a committed message changes to
one that is replicated to multiple machines, not necessarily committed to
disk. This is a different kind of guarantee with different strengths and
weaknesses (pro: data can survive destruction of the file system on one
machine, con: you will lose a few messages if you haven't sync'd and the
power goes out). We will likely retain the flush interval and time settings
for those who want fine grained control over flushing, but it is less
relevant.

Unfortunately *any* call to fsync will block appends even in a background
thread so how can we give control over physical disk persistence without
introducing high latency for the producer? The answer is that the linux
pdflush daemon actually does a very similar thing to our flush parameters.
pdflush is a daemon running on every linux machine that controls the
writing of buffered/cached data back to disk. It allows you to control the
percentage of memory filled with dirty pages by giving it either a
percentage of memory, a time out for any dirty page to be written, or a
fixed number of dirty bytes.

The question is, does pdflush block appends? The answer seems to be mostly
no. It locks the page being flushed but not the whole file. The time to
flush one page is actually usually pretty quick (plus I think it may not be
flushing just written pages anyway). I wrote some test code for this and
here are the results:

I modified the code from the link above. Here are the results from my
desktop (Centos Linux 2.6.32).

We run the test writing 1024 bytes every 100 us and flushing every 500 us:

$ ./pdflush-test 1024 100 500
21
4
3
3
9
6
Sync in 20277 us (0), sleeping for 500 us
19819
7
7
8
38
Sync in 19470 us (0), sleeping for 500 us
19048
7
4
3
8
4
Sync in 19405 us (0), sleeping for 500 us
19017
6
6
10
6
Sync in 19410 us (0), sleeping for 500 us
19025
7
7
11
6

$ cat /proc/sys/vm/dirty_writeback_centisecs
100
$ cat /proc/sys/vm/dirty_expire_centisecs
500

Now run the test with the background flush disabled (rarely running):
$ ./pdflush-test 1024 100 5000000000000 > times.txt

I ran this for 298,028 writes. The 99.9th percentile for this test is 17 us
and the max time was 2043 us (2ms).

Here is the test code:

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <pthread.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
#include <stdlib.h>

static long long microseconds(void) {
    struct timeval tv;
    long long mst;

    gettimeofday(&tv, NULL);
    mst = ((long long)tv.tv_sec)*1000000;
    mst += tv.tv_usec;
    return mst;
}

void *IOThreadEntryPoint(void *arg) {
    int fd, retval;
    long long start;
    long sleep = (long) arg;

    while(1) {
        usleep(sleep);
        start = microseconds();
        fd = open("/tmp/foo.txt",O_RDONLY);
        retval = fsync(fd);
        close(fd);
        printf("Sync in %lld us (%d), sleeping for %ld us\n",
microseconds()-start, retval, sleep);
    }
    return NULL;
}

int main(int argc, char* argv[]) {
    if(argc != 4) {
      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
      exit(1);
    }

    pthread_t thread;
    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
    long long start;
    long long ellapsed;
    int size = atoi(argv[1]);
    long write_sleep = atol(argv[2]);
    long fsync_sleep = atol(argv[3]);
    char buff[size];

    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);

    while(1) {
        start = microseconds();
        if (write(fd,buff,size) == -1) {
            perror("write");
            exit(1);
        }
        ellapsed = microseconds()-start;
        printf("%lld\n", ellapsed);
        usleep(write_sleep);
    }
    close(fd);
    exit(0);
}

Cheers,

-Jay

Re: Solution for blocking fsync in 0.8

Posted by Chris Burroughs <ch...@gmail.com>.

+list

Makes sense.  My concern was less per topic and more other things on the
same box (I probably want kafka to sync more often than my webserver,
but less often than a database).

On 2012-06-19 01:06, Jay Kreps wrote:
> Yes, that's right, it is a global setting so you lose the ability to have
> per-topic overrides. I think the idea, though, is with replication the real
> durability guarantee comes from the replication and the syncing is just to
> ensure data makes it to disk reasonably quickly.
> 
> -Jay
> 
> On Mon, Jun 18, 2012 at 6:21 PM, Chris Burroughs
> <ch...@gmail.com>wrote:
> 
>> Thanks Jay.  This is a very helpful investigation!
>>
>> On 05/24/2012 01:40 PM, Jay Kreps wrote:
>>>
>>> Unfortunately *any* call to fsync will block appends even in a background
>>> thread so how can we give control over physical disk persistence without
>>> introducing high latency for the producer? The answer is that the linux
>>> pdflush daemon actually does a very similar thing to our flush
>> parameters.
>>> pdflush is a daemon running on every linux machine that controls the
>>> writing of buffered/cached data back to disk. It allows you to control
>> the
>>> percentage of memory filled with dirty pages by giving it either a
>>> percentage of memory, a time out for any dirty page to be written, or a
>>> fixed number of dirty bytes.
>>
>>
>> This would however by necessity by a global setting right?  (Assuming
>> there is no /proc trickery to change per-pid pdflush behaviour)
>>
>

Re: Solution for blocking fsync in 0.8

Posted by Chris Burroughs <ch...@gmail.com>.

Thanks Jay.  This is a very helpful investigation!

On 05/24/2012 01:40 PM, Jay Kreps wrote:
> 
> Unfortunately *any* call to fsync will block appends even in a background
> thread so how can we give control over physical disk persistence without
> introducing high latency for the producer? The answer is that the linux
> pdflush daemon actually does a very similar thing to our flush parameters.
> pdflush is a daemon running on every linux machine that controls the
> writing of buffered/cached data back to disk. It allows you to control the
> percentage of memory filled with dirty pages by giving it either a
> percentage of memory, a time out for any dirty page to be written, or a
> fixed number of dirty bytes.


This would however by necessity by a global setting right?  (Assuming
there is no /proc trickery to change per-pid pdflush behaviour)

Re: Solution for blocking fsync in 0.8

Posted by S Ahmed <sa...@gmail.com>.

so 40ms for how many messages and what kind of payload?

And any idea how much data is blocked? (msgs/payload)

Even though 40ms doesn't seem like much, it is def. something that can
creep up in a high load environment, and something you can't really monitor
unless you have some sort of metrics built into the system.

Maybe have this built in: http://metrics.codahale.com/

On Fri, May 25, 2012 at 1:22 PM, Jay Kreps <ja...@gmail.com> wrote:

> It depends a great deal on the hw and the flush interval. I think for our
> older generation hw we saw an avg flush time of 40ms, for newer stuff we
> just got it is much less but I think that might be because the disks
> themselves have some kind of nvram or something.
>
> -Jay
>
> On Fri, May 25, 2012 at 7:09 AM, S Ahmed <sa...@gmail.com> wrote:
>
> > In practise (at linkedin), how long do you see the calls blocked for
> during
> > fsycs?
> >
> > On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <ja...@gmail.com> wrote:
> >
> > > One issue with using the filesystem for persistence is that the
> > > synchronization in the filesystem is not great. In particular the fsync
> > and
> > > fsyncdata system calls block appends to the file, apparently for the
> > entire
> > > duration of the fsync (which can be quite long). This is documented in
> > some
> > > detail here:
> > >  http://antirez.com/post/fsync-different-thread-useless.html
> > >
> > > This is a problem in 0.7 because our definition of a committed message
> is
> > > one written prior to calling fsync(). This is the only way to guarantee
> > the
> > > message is on disk. We do not hand out any messages to consumers until
> an
> > > fsync call occurs. The problem is that regardless of whether the fsync
> is
> > > in a background thread or not it will block any produce requests to the
> > > file. This is buffered a bit in the client since our produce request is
> > > effectively async in 0.7, but it can lead to weird latency spikes
> > > nontheless as this buffering gets filled.
> > >
> > > In 0.8 with replication the definition of a committed message changes
> to
> > > one that is replicated to multiple machines, not necessarily committed
> to
> > > disk. This is a different kind of guarantee with different strengths
> and
> > > weaknesses (pro: data can survive destruction of the file system on one
> > > machine, con: you will lose a few messages if you haven't sync'd and
> the
> > > power goes out). We will likely retain the flush interval and time
> > settings
> > > for those who want fine grained control over flushing, but it is less
> > > relevant.
> > >
> > > Unfortunately *any* call to fsync will block appends even in a
> background
> > > thread so how can we give control over physical disk persistence
> without
> > > introducing high latency for the producer? The answer is that the linux
> > > pdflush daemon actually does a very similar thing to our flush
> > parameters.
> > > pdflush is a daemon running on every linux machine that controls the
> > > writing of buffered/cached data back to disk. It allows you to control
> > the
> > > percentage of memory filled with dirty pages by giving it either a
> > > percentage of memory, a time out for any dirty page to be written, or a
> > > fixed number of dirty bytes.
> > >
> > > The question is, does pdflush block appends? The answer seems to be
> > mostly
> > > no. It locks the page being flushed but not the whole file. The time to
> > > flush one page is actually usually pretty quick (plus I think it may
> not
> > be
> > > flushing just written pages anyway). I wrote some test code for this
> and
> > > here are the results:
> > >
> > > I modified the code from the link above. Here are the results from my
> > > desktop (Centos Linux 2.6.32).
> > >
> > > We run the test writing 1024 bytes every 100 us and flushing every 500
> > us:
> > >
> > > $ ./pdflush-test 1024 100 500
> > > 21
> > > 4
> > > 3
> > > 3
> > > 9
> > > 6
> > > Sync in 20277 us (0), sleeping for 500 us
> > > 19819
> > > 7
> > > 7
> > > 8
> > > 38
> > > Sync in 19470 us (0), sleeping for 500 us
> > > 19048
> > > 7
> > > 4
> > > 3
> > > 8
> > > 4
> > > Sync in 19405 us (0), sleeping for 500 us
> > > 19017
> > > 6
> > > 6
> > > 10
> > > 6
> > > Sync in 19410 us (0), sleeping for 500 us
> > > 19025
> > > 7
> > > 7
> > > 11
> > > 6
> > >
> > > $ cat /proc/sys/vm/dirty_writeback_centisecs
> > > 100
> > > $ cat /proc/sys/vm/dirty_expire_centisecs
> > > 500
> > >
> > > Now run the test with the background flush disabled (rarely running):
> > > $ ./pdflush-test 1024 100 5000000000000 > times.txt
> > >
> > > I ran this for 298,028 writes. The 99.9th percentile for this test is
> 17
> > us
> > > and the max time was 2043 us (2ms).
> > >
> > > Here is the test code:
> > >
> > > #include <stdio.h>
> > > #include <unistd.h>
> > > #include <string.h>
> > > #include <sys/types.h>
> > > #include <pthread.h>
> > > #include <sys/stat.h>
> > > #include <fcntl.h>
> > > #include <sys/time.h>
> > > #include <stdlib.h>
> > >
> > > static long long microseconds(void) {
> > >    struct timeval tv;
> > >    long long mst;
> > >
> > >    gettimeofday(&tv, NULL);
> > >    mst = ((long long)tv.tv_sec)*1000000;
> > >    mst += tv.tv_usec;
> > >    return mst;
> > > }
> > >
> > > void *IOThreadEntryPoint(void *arg) {
> > >    int fd, retval;
> > >    long long start;
> > >    long sleep = (long) arg;
> > >
> > >    while(1) {
> > >        usleep(sleep);
> > >        start = microseconds();
> > >        fd = open("/tmp/foo.txt",O_RDONLY);
> > >        retval = fsync(fd);
> > >        close(fd);
> > >        printf("Sync in %lld us (%d), sleeping for %ld us\n",
> > > microseconds()-start, retval, sleep);
> > >    }
> > >    return NULL;
> > > }
> > >
> > > int main(int argc, char* argv[]) {
> > >    if(argc != 4) {
> > >      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
> > >      exit(1);
> > >    }
> > >
> > >    pthread_t thread;
> > >    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
> > >    long long start;
> > >    long long ellapsed;
> > >    int size = atoi(argv[1]);
> > >    long write_sleep = atol(argv[2]);
> > >    long fsync_sleep = atol(argv[3]);
> > >    char buff[size];
> > >
> > >    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*)
> fsync_sleep);
> > >
> > >    while(1) {
> > >        start = microseconds();
> > >        if (write(fd,buff,size) == -1) {
> > >            perror("write");
> > >            exit(1);
> > >        }
> > >        ellapsed = microseconds()-start;
> > >        printf("%lld\n", ellapsed);
> > >        usleep(write_sleep);
> > >    }
> > >    close(fd);
> > >    exit(0);
> > > }
> > >
> > > Cheers,
> > >
> > > -Jay
> > >
> >
>

Re: Solution for blocking fsync in 0.8

Posted by Jay Kreps <ja...@gmail.com>.

It depends a great deal on the hw and the flush interval. I think for our
older generation hw we saw an avg flush time of 40ms, for newer stuff we
just got it is much less but I think that might be because the disks
themselves have some kind of nvram or something.

-Jay

On Fri, May 25, 2012 at 7:09 AM, S Ahmed <sa...@gmail.com> wrote:

> In practise (at linkedin), how long do you see the calls blocked for during
> fsycs?
>
> On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <ja...@gmail.com> wrote:
>
> > One issue with using the filesystem for persistence is that the
> > synchronization in the filesystem is not great. In particular the fsync
> and
> > fsyncdata system calls block appends to the file, apparently for the
> entire
> > duration of the fsync (which can be quite long). This is documented in
> some
> > detail here:
> >  http://antirez.com/post/fsync-different-thread-useless.html
> >
> > This is a problem in 0.7 because our definition of a committed message is
> > one written prior to calling fsync(). This is the only way to guarantee
> the
> > message is on disk. We do not hand out any messages to consumers until an
> > fsync call occurs. The problem is that regardless of whether the fsync is
> > in a background thread or not it will block any produce requests to the
> > file. This is buffered a bit in the client since our produce request is
> > effectively async in 0.7, but it can lead to weird latency spikes
> > nontheless as this buffering gets filled.
> >
> > In 0.8 with replication the definition of a committed message changes to
> > one that is replicated to multiple machines, not necessarily committed to
> > disk. This is a different kind of guarantee with different strengths and
> > weaknesses (pro: data can survive destruction of the file system on one
> > machine, con: you will lose a few messages if you haven't sync'd and the
> > power goes out). We will likely retain the flush interval and time
> settings
> > for those who want fine grained control over flushing, but it is less
> > relevant.
> >
> > Unfortunately *any* call to fsync will block appends even in a background
> > thread so how can we give control over physical disk persistence without
> > introducing high latency for the producer? The answer is that the linux
> > pdflush daemon actually does a very similar thing to our flush
> parameters.
> > pdflush is a daemon running on every linux machine that controls the
> > writing of buffered/cached data back to disk. It allows you to control
> the
> > percentage of memory filled with dirty pages by giving it either a
> > percentage of memory, a time out for any dirty page to be written, or a
> > fixed number of dirty bytes.
> >
> > The question is, does pdflush block appends? The answer seems to be
> mostly
> > no. It locks the page being flushed but not the whole file. The time to
> > flush one page is actually usually pretty quick (plus I think it may not
> be
> > flushing just written pages anyway). I wrote some test code for this and
> > here are the results:
> >
> > I modified the code from the link above. Here are the results from my
> > desktop (Centos Linux 2.6.32).
> >
> > We run the test writing 1024 bytes every 100 us and flushing every 500
> us:
> >
> > $ ./pdflush-test 1024 100 500
> > 21
> > 4
> > 3
> > 3
> > 9
> > 6
> > Sync in 20277 us (0), sleeping for 500 us
> > 19819
> > 7
> > 7
> > 8
> > 38
> > Sync in 19470 us (0), sleeping for 500 us
> > 19048
> > 7
> > 4
> > 3
> > 8
> > 4
> > Sync in 19405 us (0), sleeping for 500 us
> > 19017
> > 6
> > 6
> > 10
> > 6
> > Sync in 19410 us (0), sleeping for 500 us
> > 19025
> > 7
> > 7
> > 11
> > 6
> >
> > $ cat /proc/sys/vm/dirty_writeback_centisecs
> > 100
> > $ cat /proc/sys/vm/dirty_expire_centisecs
> > 500
> >
> > Now run the test with the background flush disabled (rarely running):
> > $ ./pdflush-test 1024 100 5000000000000 > times.txt
> >
> > I ran this for 298,028 writes. The 99.9th percentile for this test is 17
> us
> > and the max time was 2043 us (2ms).
> >
> > Here is the test code:
> >
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <sys/types.h>
> > #include <pthread.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sys/time.h>
> > #include <stdlib.h>
> >
> > static long long microseconds(void) {
> >    struct timeval tv;
> >    long long mst;
> >
> >    gettimeofday(&tv, NULL);
> >    mst = ((long long)tv.tv_sec)*1000000;
> >    mst += tv.tv_usec;
> >    return mst;
> > }
> >
> > void *IOThreadEntryPoint(void *arg) {
> >    int fd, retval;
> >    long long start;
> >    long sleep = (long) arg;
> >
> >    while(1) {
> >        usleep(sleep);
> >        start = microseconds();
> >        fd = open("/tmp/foo.txt",O_RDONLY);
> >        retval = fsync(fd);
> >        close(fd);
> >        printf("Sync in %lld us (%d), sleeping for %ld us\n",
> > microseconds()-start, retval, sleep);
> >    }
> >    return NULL;
> > }
> >
> > int main(int argc, char* argv[]) {
> >    if(argc != 4) {
> >      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
> >      exit(1);
> >    }
> >
> >    pthread_t thread;
> >    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
> >    long long start;
> >    long long ellapsed;
> >    int size = atoi(argv[1]);
> >    long write_sleep = atol(argv[2]);
> >    long fsync_sleep = atol(argv[3]);
> >    char buff[size];
> >
> >    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);
> >
> >    while(1) {
> >        start = microseconds();
> >        if (write(fd,buff,size) == -1) {
> >            perror("write");
> >            exit(1);
> >        }
> >        ellapsed = microseconds()-start;
> >        printf("%lld\n", ellapsed);
> >        usleep(write_sleep);
> >    }
> >    close(fd);
> >    exit(0);
> > }
> >
> > Cheers,
> >
> > -Jay
> >
>

Re: Solution for blocking fsync in 0.8

Posted by S Ahmed <sa...@gmail.com>.

In practise (at linkedin), how long do you see the calls blocked for during
fsycs?

On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <ja...@gmail.com> wrote:

> One issue with using the filesystem for persistence is that the
> synchronization in the filesystem is not great. In particular the fsync and
> fsyncdata system calls block appends to the file, apparently for the entire
> duration of the fsync (which can be quite long). This is documented in some
> detail here:
>  http://antirez.com/post/fsync-different-thread-useless.html
>
> This is a problem in 0.7 because our definition of a committed message is
> one written prior to calling fsync(). This is the only way to guarantee the
> message is on disk. We do not hand out any messages to consumers until an
> fsync call occurs. The problem is that regardless of whether the fsync is
> in a background thread or not it will block any produce requests to the
> file. This is buffered a bit in the client since our produce request is
> effectively async in 0.7, but it can lead to weird latency spikes
> nontheless as this buffering gets filled.
>
> In 0.8 with replication the definition of a committed message changes to
> one that is replicated to multiple machines, not necessarily committed to
> disk. This is a different kind of guarantee with different strengths and
> weaknesses (pro: data can survive destruction of the file system on one
> machine, con: you will lose a few messages if you haven't sync'd and the
> power goes out). We will likely retain the flush interval and time settings
> for those who want fine grained control over flushing, but it is less
> relevant.
>
> Unfortunately *any* call to fsync will block appends even in a background
> thread so how can we give control over physical disk persistence without
> introducing high latency for the producer? The answer is that the linux
> pdflush daemon actually does a very similar thing to our flush parameters.
> pdflush is a daemon running on every linux machine that controls the
> writing of buffered/cached data back to disk. It allows you to control the
> percentage of memory filled with dirty pages by giving it either a
> percentage of memory, a time out for any dirty page to be written, or a
> fixed number of dirty bytes.
>
> The question is, does pdflush block appends? The answer seems to be mostly
> no. It locks the page being flushed but not the whole file. The time to
> flush one page is actually usually pretty quick (plus I think it may not be
> flushing just written pages anyway). I wrote some test code for this and
> here are the results:
>
> I modified the code from the link above. Here are the results from my
> desktop (Centos Linux 2.6.32).
>
> We run the test writing 1024 bytes every 100 us and flushing every 500 us:
>
> $ ./pdflush-test 1024 100 500
> 21
> 4
> 3
> 3
> 9
> 6
> Sync in 20277 us (0), sleeping for 500 us
> 19819
> 7
> 7
> 8
> 38
> Sync in 19470 us (0), sleeping for 500 us
> 19048
> 7
> 4
> 3
> 8
> 4
> Sync in 19405 us (0), sleeping for 500 us
> 19017
> 6
> 6
> 10
> 6
> Sync in 19410 us (0), sleeping for 500 us
> 19025
> 7
> 7
> 11
> 6
>
> $ cat /proc/sys/vm/dirty_writeback_centisecs
> 100
> $ cat /proc/sys/vm/dirty_expire_centisecs
> 500
>
> Now run the test with the background flush disabled (rarely running):
> $ ./pdflush-test 1024 100 5000000000000 > times.txt
>
> I ran this for 298,028 writes. The 99.9th percentile for this test is 17 us
> and the max time was 2043 us (2ms).
>
> Here is the test code:
>
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/types.h>
> #include <pthread.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/time.h>
> #include <stdlib.h>
>
> static long long microseconds(void) {
>    struct timeval tv;
>    long long mst;
>
>    gettimeofday(&tv, NULL);
>    mst = ((long long)tv.tv_sec)*1000000;
>    mst += tv.tv_usec;
>    return mst;
> }
>
> void *IOThreadEntryPoint(void *arg) {
>    int fd, retval;
>    long long start;
>    long sleep = (long) arg;
>
>    while(1) {
>        usleep(sleep);
>        start = microseconds();
>        fd = open("/tmp/foo.txt",O_RDONLY);
>        retval = fsync(fd);
>        close(fd);
>        printf("Sync in %lld us (%d), sleeping for %ld us\n",
> microseconds()-start, retval, sleep);
>    }
>    return NULL;
> }
>
> int main(int argc, char* argv[]) {
>    if(argc != 4) {
>      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
>      exit(1);
>    }
>
>    pthread_t thread;
>    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
>    long long start;
>    long long ellapsed;
>    int size = atoi(argv[1]);
>    long write_sleep = atol(argv[2]);
>    long fsync_sleep = atol(argv[3]);
>    char buff[size];
>
>    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);
>
>    while(1) {
>        start = microseconds();
>        if (write(fd,buff,size) == -1) {
>            perror("write");
>            exit(1);
>        }
>        ellapsed = microseconds()-start;
>        printf("%lld\n", ellapsed);
>        usleep(write_sleep);
>    }
>    close(fd);
>    exit(0);
> }
>
> Cheers,
>
> -Jay
>

Re: Solution for blocking fsync in 0.8

Posted by Neelesh <ne...@gmail.com>.

ext4 numbers on my ubuntu 11.10 don't look all that good. The write is
blocked significantly

rite in 41 microseconds
Write in 25 microseconds
Sync in 2803 microseconds (0)
Write in 858 microseconds
Write in 27 microseconds
Write in 21 microseconds
Write in 24 microseconds
Write in 26 microseconds
Write in 26 microseconds
Write in 23 microseconds
Write in 27 microseconds
Write in 20 microseconds
Write in 37 microseconds
Sync in 47403 microseconds (0)
Write in 45376 microseconds
Write in 27 microseconds
Write in 23 microseconds
Write in 45 microseconds
Write in 29 microseconds
Write in 50 microseconds
Write in 29 microseconds
Write in 26 microseconds
Write in 26 microseconds
Write in 28 microseconds
Sync in 36169 microseconds (0)
Write in 32379 microseconds
Write in 27 microseconds
Write in 51 microseconds
Write in 44 microseconds
Write in 26 microseconds
Write in 44 microseconds
Write in 26 microseconds
Write in 28 microseconds
Write in 25 microseconds
Write in 27 microseconds
Sync in 3356 microseconds (0)
Write in 1888 microseconds
Write in 23 microseconds
Write in 45 microseconds
Write in 25 microseconds
Write in 44 microseconds
Write in 50 microseconds
Write in 40 microseconds
Write in 35 microseconds
Write in 104 microseconds

Thanks!
-neelesh


On Thu, May 24, 2012 at 3:01 PM, Neelesh <ne...@gmail.com> wrote:

> one of the comments in the article you mentioned has some numbers that
> essentially say ext4 behaves much better (i haven't tested this yet, I'll
> do it tonight and post the results).
>
> thanks
> -neelesh
>
>
> On Thu, May 24, 2012 at 10:40 AM, Jay Kreps <ja...@gmail.com> wrote:
>
>> One issue with using the filesystem for persistence is that the
>> synchronization in the filesystem is not great. In particular the fsync
>> and
>> fsyncdata system calls block appends to the file, apparently for the
>> entire
>> duration of the fsync (which can be quite long). This is documented in
>> some
>> detail here:
>>  http://antirez.com/post/fsync-different-thread-useless.html
>>
>> This is a problem in 0.7 because our definition of a committed message is
>> one written prior to calling fsync(). This is the only way to guarantee
>> the
>> message is on disk. We do not hand out any messages to consumers until an
>> fsync call occurs. The problem is that regardless of whether the fsync is
>> in a background thread or not it will block any produce requests to the
>> file. This is buffered a bit in the client since our produce request is
>> effectively async in 0.7, but it can lead to weird latency spikes
>> nontheless as this buffering gets filled.
>>
>> In 0.8 with replication the definition of a committed message changes to
>> one that is replicated to multiple machines, not necessarily committed to
>> disk. This is a different kind of guarantee with different strengths and
>> weaknesses (pro: data can survive destruction of the file system on one
>> machine, con: you will lose a few messages if you haven't sync'd and the
>> power goes out). We will likely retain the flush interval and time
>> settings
>> for those who want fine grained control over flushing, but it is less
>> relevant.
>>
>> Unfortunately *any* call to fsync will block appends even in a background
>> thread so how can we give control over physical disk persistence without
>> introducing high latency for the producer? The answer is that the linux
>> pdflush daemon actually does a very similar thing to our flush parameters.
>> pdflush is a daemon running on every linux machine that controls the
>> writing of buffered/cached data back to disk. It allows you to control the
>> percentage of memory filled with dirty pages by giving it either a
>> percentage of memory, a time out for any dirty page to be written, or a
>> fixed number of dirty bytes.
>>
>> The question is, does pdflush block appends? The answer seems to be mostly
>> no. It locks the page being flushed but not the whole file. The time to
>> flush one page is actually usually pretty quick (plus I think it may not
>> be
>> flushing just written pages anyway). I wrote some test code for this and
>> here are the results:
>>
>> I modified the code from the link above. Here are the results from my
>> desktop (Centos Linux 2.6.32).
>>
>> We run the test writing 1024 bytes every 100 us and flushing every 500 us:
>>
>> $ ./pdflush-test 1024 100 500
>> 21
>> 4
>> 3
>> 3
>> 9
>> 6
>> Sync in 20277 us (0), sleeping for 500 us
>> 19819
>> 7
>> 7
>> 8
>> 38
>> Sync in 19470 us (0), sleeping for 500 us
>> 19048
>> 7
>> 4
>> 3
>> 8
>> 4
>> Sync in 19405 us (0), sleeping for 500 us
>> 19017
>> 6
>> 6
>> 10
>> 6
>> Sync in 19410 us (0), sleeping for 500 us
>> 19025
>> 7
>> 7
>> 11
>> 6
>>
>> $ cat /proc/sys/vm/dirty_writeback_centisecs
>> 100
>> $ cat /proc/sys/vm/dirty_expire_centisecs
>> 500
>>
>> Now run the test with the background flush disabled (rarely running):
>> $ ./pdflush-test 1024 100 5000000000000 > times.txt
>>
>> I ran this for 298,028 writes. The 99.9th percentile for this test is 17
>> us
>> and the max time was 2043 us (2ms).
>>
>> Here is the test code:
>>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <sys/types.h>
>> #include <pthread.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/time.h>
>> #include <stdlib.h>
>>
>> static long long microseconds(void) {
>>    struct timeval tv;
>>    long long mst;
>>
>>    gettimeofday(&tv, NULL);
>>    mst = ((long long)tv.tv_sec)*1000000;
>>    mst += tv.tv_usec;
>>    return mst;
>> }
>>
>> void *IOThreadEntryPoint(void *arg) {
>>    int fd, retval;
>>    long long start;
>>    long sleep = (long) arg;
>>
>>    while(1) {
>>        usleep(sleep);
>>        start = microseconds();
>>        fd = open("/tmp/foo.txt",O_RDONLY);
>>        retval = fsync(fd);
>>        close(fd);
>>        printf("Sync in %lld us (%d), sleeping for %ld us\n",
>> microseconds()-start, retval, sleep);
>>    }
>>    return NULL;
>> }
>>
>> int main(int argc, char* argv[]) {
>>    if(argc != 4) {
>>      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
>>      exit(1);
>>    }
>>
>>    pthread_t thread;
>>    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
>>    long long start;
>>    long long ellapsed;
>>    int size = atoi(argv[1]);
>>    long write_sleep = atol(argv[2]);
>>    long fsync_sleep = atol(argv[3]);
>>    char buff[size];
>>
>>    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);
>>
>>    while(1) {
>>        start = microseconds();
>>        if (write(fd,buff,size) == -1) {
>>            perror("write");
>>            exit(1);
>>        }
>>        ellapsed = microseconds()-start;
>>        printf("%lld\n", ellapsed);
>>        usleep(write_sleep);
>>    }
>>    close(fd);
>>    exit(0);
>> }
>>
>> Cheers,
>>
>> -Jay
>>
>
>

Re: Solution for blocking fsync in 0.8

Posted by Neelesh <ne...@gmail.com>.

one of the comments in the article you mentioned has some numbers that
essentially say ext4 behaves much better (i haven't tested this yet, I'll
do it tonight and post the results).

thanks
-neelesh

On Thu, May 24, 2012 at 10:40 AM, Jay Kreps <ja...@gmail.com> wrote:

> One issue with using the filesystem for persistence is that the
> synchronization in the filesystem is not great. In particular the fsync and
> fsyncdata system calls block appends to the file, apparently for the entire
> duration of the fsync (which can be quite long). This is documented in some
> detail here:
>  http://antirez.com/post/fsync-different-thread-useless.html
>
> This is a problem in 0.7 because our definition of a committed message is
> one written prior to calling fsync(). This is the only way to guarantee the
> message is on disk. We do not hand out any messages to consumers until an
> fsync call occurs. The problem is that regardless of whether the fsync is
> in a background thread or not it will block any produce requests to the
> file. This is buffered a bit in the client since our produce request is
> effectively async in 0.7, but it can lead to weird latency spikes
> nontheless as this buffering gets filled.
>
> In 0.8 with replication the definition of a committed message changes to
> one that is replicated to multiple machines, not necessarily committed to
> disk. This is a different kind of guarantee with different strengths and
> weaknesses (pro: data can survive destruction of the file system on one
> machine, con: you will lose a few messages if you haven't sync'd and the
> power goes out). We will likely retain the flush interval and time settings
> for those who want fine grained control over flushing, but it is less
> relevant.
>
> Unfortunately *any* call to fsync will block appends even in a background
> thread so how can we give control over physical disk persistence without
> introducing high latency for the producer? The answer is that the linux
> pdflush daemon actually does a very similar thing to our flush parameters.
> pdflush is a daemon running on every linux machine that controls the
> writing of buffered/cached data back to disk. It allows you to control the
> percentage of memory filled with dirty pages by giving it either a
> percentage of memory, a time out for any dirty page to be written, or a
> fixed number of dirty bytes.
>
> The question is, does pdflush block appends? The answer seems to be mostly
> no. It locks the page being flushed but not the whole file. The time to
> flush one page is actually usually pretty quick (plus I think it may not be
> flushing just written pages anyway). I wrote some test code for this and
> here are the results:
>
> I modified the code from the link above. Here are the results from my
> desktop (Centos Linux 2.6.32).
>
> We run the test writing 1024 bytes every 100 us and flushing every 500 us:
>
> $ ./pdflush-test 1024 100 500
> 21
> 4
> 3
> 3
> 9
> 6
> Sync in 20277 us (0), sleeping for 500 us
> 19819
> 7
> 7
> 8
> 38
> Sync in 19470 us (0), sleeping for 500 us
> 19048
> 7
> 4
> 3
> 8
> 4
> Sync in 19405 us (0), sleeping for 500 us
> 19017
> 6
> 6
> 10
> 6
> Sync in 19410 us (0), sleeping for 500 us
> 19025
> 7
> 7
> 11
> 6
>
> $ cat /proc/sys/vm/dirty_writeback_centisecs
> 100
> $ cat /proc/sys/vm/dirty_expire_centisecs
> 500
>
> Now run the test with the background flush disabled (rarely running):
> $ ./pdflush-test 1024 100 5000000000000 > times.txt
>
> I ran this for 298,028 writes. The 99.9th percentile for this test is 17 us
> and the max time was 2043 us (2ms).
>
> Here is the test code:
>
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/types.h>
> #include <pthread.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/time.h>
> #include <stdlib.h>
>
> static long long microseconds(void) {
>    struct timeval tv;
>    long long mst;
>
>    gettimeofday(&tv, NULL);
>    mst = ((long long)tv.tv_sec)*1000000;
>    mst += tv.tv_usec;
>    return mst;
> }
>
> void *IOThreadEntryPoint(void *arg) {
>    int fd, retval;
>    long long start;
>    long sleep = (long) arg;
>
>    while(1) {
>        usleep(sleep);
>        start = microseconds();
>        fd = open("/tmp/foo.txt",O_RDONLY);
>        retval = fsync(fd);
>        close(fd);
>        printf("Sync in %lld us (%d), sleeping for %ld us\n",
> microseconds()-start, retval, sleep);
>    }
>    return NULL;
> }
>
> int main(int argc, char* argv[]) {
>    if(argc != 4) {
>      printf("USAGE: %s size write_sleep fsync_sleep\n", argv[0]);
>      exit(1);
>    }
>
>    pthread_t thread;
>    int fd = open("/tmp/foo.txt",O_WRONLY|O_CREAT,0644);
>    long long start;
>    long long ellapsed;
>    int size = atoi(argv[1]);
>    long write_sleep = atol(argv[2]);
>    long fsync_sleep = atol(argv[3]);
>    char buff[size];
>
>    pthread_create(&thread,NULL,IOThreadEntryPoint, (void*) fsync_sleep);
>
>    while(1) {
>        start = microseconds();
>        if (write(fd,buff,size) == -1) {
>            perror("write");
>            exit(1);
>        }
>        ellapsed = microseconds()-start;
>        printf("%lld\n", ellapsed);
>        usleep(write_sleep);
>    }
>    close(fd);
>    exit(0);
> }
>
> Cheers,
>
> -Jay
>

Re: Solution for blocking fsync in 0.8

Posted by Chris Burroughs <ch...@gmail.com>.

Thanks Jay.  This is a very helpful investigation!

On 05/24/2012 01:40 PM, Jay Kreps wrote:
> 
> Unfortunately *any* call to fsync will block appends even in a background
> thread so how can we give control over physical disk persistence without
> introducing high latency for the producer? The answer is that the linux
> pdflush daemon actually does a very similar thing to our flush parameters.
> pdflush is a daemon running on every linux machine that controls the
> writing of buffered/cached data back to disk. It allows you to control the
> percentage of memory filled with dirty pages by giving it either a
> percentage of memory, a time out for any dirty page to be written, or a
> fixed number of dirty bytes.


This would however by necessity by a global setting right?  (Assuming
there is no /proc trickery to change per-pid pdflush behaviour)