You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Scott Smith <sc...@gmail.com> on 2012/09/20 20:24:40 UTC

Per task overhead

I am working on a project using Dpark, a port of Spark to python.  We
are running into issues where the overhead for scheduling a task is
too high.  On the m1.small nodes I'm using, it takes a minimum of
about 7ms to schedule and receive the completion for a scheduled task.
 I set up timers around the call resourceOffers and statusUpdate, and
I was able to get Dpark down to about 0.7-0.8ms per task (of that,
0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
so the actual Python CPU time is quite low), so it would seem that the
mesos master and client library are taking the other 6.2ms.  This
seems excessive, and I'm guessing there's some low hanging fruit to
optimize.

I captured the output using strace, and found this:

[pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
st_size=91, ...}) = 0
[pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
[pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
[pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
[pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
[pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
now in progress)
[pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
u64=4294967384}}}, 64, 573) = 1
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
[pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
u64=519691042951}}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
[pid 25362] epoll_wait(87,  <unfinished ...>
[pid 25361] <... write resumed> )       = 8
[pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
st_size=91, ...}) = 0
[pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
[pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
[pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
[pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
u64=519691042951}}}, 64, 572) = 1
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
[pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
u64=523986010247}}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
[pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
64, 571) = 1
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
[pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
MSG_NOSIGNAL, NULL, 0) = 301
[pid 25362] close(135)                  = 0
[pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file descriptor)

>From this there are several possible optimizations:
1. Don't context switch so often.  Opening a socket in one thread just
to toss it to another thread is inefficient as the kernel must switch
between processes, plus you need extra calls for the eventfd to signal
the other thread.
2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
3. Don't call shutdown after close.  You don't really need to call it at all.
4. Do more than 1 thing with a connection.  As far as I can tell, this
is a communication from the client library to the slave; it should be
possible for those to be longer living.
5. Don't use gettimeofday so often; it is unreliable since time can
move backwards.  Excessive system calls in general should be avoided,
so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
eliminated, and it could rely on some infrequently updated and coarse
value stored in memory.

I'm not sure how much of this will address my performance issues or
not, but it seemed like a reasonable start.  I'm debating whether to
dive into the code myself, but the few times I took a look I quickly
got lost.  Is there a diagram on how the libraries/files fit together,
or maybe a protocol description (so I could just replace it with a
native python library?)

Or alternatively, are there plans to improve the performance of Mesos?
 I know the Spark authors have a mode to bypass Mesos in their next
release; I'm debating whether to follow suit or not.
-- 
        Scott

Re: Per task overhead

Posted by Benjamin Mahler <bm...@twitter.com>.
And by Tom, I mean Scott :)

On Thu, Sep 20, 2012 at 6:31 PM, Benjamin Mahler <bm...@twitter.com>wrote:

> Great stuff Tom!
>
> I'll let benh answer your performance questions, but as for the protobuf
> issue, I've created a ticket<https://issues.apache.org/jira/browse/MESOS-283>to track this particular performance improvement. We should definitely be
> using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp and ideally we should
> also be using pre-generated cpp protobuf code.
>
> P.S. patches <https://reviews.apache.org/groups/mesos/> are always
> welcome and appreciated! :)
>
> On Thu, Sep 20, 2012 at 5:23 PM, Scott Smith <sc...@gmail.com>wrote:
>
>> btw I tried calling 'link' for each new slave in src/sched/sched.cpp.
>> It got rid of the excess connections, but only helped ~5-10%.
>>
>> I then noticed that the proxy scheduler converts to serialized pb and
>> then back via python in order to hand objects over the wall.  I tried
>> the CPP protobuf implementation for python, and that helped another
>> 10%.
>>
>> The resulting trace from perf (Linux system profiler) looks like this:
>>
>>     33.43%           python  python2.7
>>     12.21%     mesos-master  libmesos-0.9.0.so
>>     11.11%     mesos-master  [kernel.kallsyms]
>>      9.12%           python  [kernel.kallsyms]
>>      8.37%     mesos-master  libc-2.15.so
>>      7.11%           python  libc-2.15.so
>>      5.26%           python  _mesos.so
>>      4.16%     mesos-master  libstdc++.so.6.0.16
>>      2.76%           python  libstdc++.so.6.0.16
>>      2.27%           python  _net_proto2___python.so
>>      0.92%           python  libpthread-2.15.so
>>      0.76%     mesos-master  libpthread-2.15.so
>>
>> I'm surprised how much time is spent in the python process (python2.7
>> - 33.43%) given what I measured from resourceOffers and statusUpdate.
>> Native python profilers don't seem to be of any assistance here (I
>> tried cProfile and yappi).  I ran top and used the H option to view
>> threads, and verified (with gettid()) that the majority (~80%) of the
>> time is spent in the scheduler driver thread.  The other ~20% was in a
>> thread that I couldn't identify, which leads me to believe that's the
>> libprocess event loop thread.
>>
>> The amount of time in the kernel is a bit disconcerting; I do wonder
>> if the Mesos model simply requires too much communication (for a given
>> task, how many client<->master, client<->slave, and master<->slave
>> messages are there?), while direct scheduling would allow for 2, and
>> could then use Mesos for higher level resource allocation tasks and
>> failure detection/handling.  Furthermore, direct schedule would allow
>> for distributed scheduling.
>>
>> On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman
>> <be...@eecs.berkeley.edu> wrote:
>> > Scott,
>> >
>> > This is amazingly great! We are committed to improving the performance
>> of
>> > Mesos (and the underlying communication library, libprocess), and
>> getting
>> > very detailed information like this is incredibly valuable. There are
>> > definitely a lot of low-hanging performance optimizations that we just
>> > haven't had a chance to get to yet.
>> >
>> > There is not a great document today that describes (a) the structural
>> > relationships between different components or (b) the protocol used
>> between
>> > components (but it's pretty simple, it's HTTP with bodies that are just
>> > binary protocol buffers). I'll try and take it upon myself to deliver
>> this
>> > document within the month. In the mean time, feel free to create a JIRA
>> > ticket about improving the performance of Mesos so that we can drill
>> down
>> > to some real actionable code changes that can get committed.
>> >
>> > Thanks so much!
>> >
>> > Ben.
>> >
>> >
>> >
>> > On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <sc...@gmail.com>
>> wrote:
>> >
>> >> I am working on a project using Dpark, a port of Spark to python.  We
>> >> are running into issues where the overhead for scheduling a task is
>> >> too high.  On the m1.small nodes I'm using, it takes a minimum of
>> >> about 7ms to schedule and receive the completion for a scheduled task.
>> >>  I set up timers around the call resourceOffers and statusUpdate, and
>> >> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
>> >> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
>> >> so the actual Python CPU time is quite low), so it would seem that the
>> >> mesos master and client library are taking the other 6.2ms.  This
>> >> seems excessive, and I'm guessing there's some low hanging fruit to
>> >> optimize.
>> >>
>> >> I captured the output using strace, and found this:
>> >>
>> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> >> st_size=91, ...}) = 0
>> >> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
>> >> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
>> >> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
>> >> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> >> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
>> >> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
>> >> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
>> >> now in progress)
>> >> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
>> >> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
>> >> u64=4294967384}}}, 64, 573) = 1
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
>> >> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
>> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
>> >> u64=519691042951}}) = 0
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
>> >> [pid 25362] epoll_wait(87,  <unfinished ...>
>> >> [pid 25361] <... write resumed> )       = 8
>> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> >> st_size=91, ...}) = 0
>> >> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
>> >> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
>> >> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
>> >> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
>> >> u64=519691042951}}}, 64, 572) = 1
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
>> >> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
>> >> u64=523986010247}}) = 0
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
>> >> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
>> >> 64, 571) = 1
>> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
>> >> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
>> >> MSG_NOSIGNAL, NULL, 0) = 301
>> >> [pid 25362] close(135)                  = 0
>> >> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file
>> descriptor)
>> >>
>> >> From this there are several possible optimizations:
>> >> 1. Don't context switch so often.  Opening a socket in one thread just
>> >> to toss it to another thread is inefficient as the kernel must switch
>> >> between processes, plus you need extra calls for the eventfd to signal
>> >> the other thread.
>> >> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
>> >> 3. Don't call shutdown after close.  You don't really need to call it
>> at
>> >> all.
>> >> 4. Do more than 1 thing with a connection.  As far as I can tell, this
>> >> is a communication from the client library to the slave; it should be
>> >> possible for those to be longer living.
>> >> 5. Don't use gettimeofday so often; it is unreliable since time can
>> >> move backwards.  Excessive system calls in general should be avoided,
>> >> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
>> >> eliminated, and it could rely on some infrequently updated and coarse
>> >> value stored in memory.
>> >>
>> >> I'm not sure how much of this will address my performance issues or
>> >> not, but it seemed like a reasonable start.  I'm debating whether to
>> >> dive into the code myself, but the few times I took a look I quickly
>> >> got lost.  Is there a diagram on how the libraries/files fit together,
>> >> or maybe a protocol description (so I could just replace it with a
>> >> native python library?)
>> >>
>> >> Or alternatively, are there plans to improve the performance of Mesos?
>> >>  I know the Spark authors have a mode to bypass Mesos in their next
>> >> release; I'm debating whether to follow suit or not.
>> >> --
>> >>         Scott
>> >>
>>
>>
>>
>> --
>>         Scott
>>
>
>

Re: Per task overhead

Posted by Benjamin Mahler <bm...@twitter.com>.
Great stuff Tom!

I'll let benh answer your performance questions, but as for the protobuf
issue, I've created a
ticket<https://issues.apache.org/jira/browse/MESOS-283>to track this
particular performance improvement. We should definitely be
using PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp and ideally we should also
be using pre-generated cpp protobuf code.

P.S. patches <https://reviews.apache.org/groups/mesos/> are always welcome
and appreciated! :)

On Thu, Sep 20, 2012 at 5:23 PM, Scott Smith <sc...@gmail.com> wrote:

> btw I tried calling 'link' for each new slave in src/sched/sched.cpp.
> It got rid of the excess connections, but only helped ~5-10%.
>
> I then noticed that the proxy scheduler converts to serialized pb and
> then back via python in order to hand objects over the wall.  I tried
> the CPP protobuf implementation for python, and that helped another
> 10%.
>
> The resulting trace from perf (Linux system profiler) looks like this:
>
>     33.43%           python  python2.7
>     12.21%     mesos-master  libmesos-0.9.0.so
>     11.11%     mesos-master  [kernel.kallsyms]
>      9.12%           python  [kernel.kallsyms]
>      8.37%     mesos-master  libc-2.15.so
>      7.11%           python  libc-2.15.so
>      5.26%           python  _mesos.so
>      4.16%     mesos-master  libstdc++.so.6.0.16
>      2.76%           python  libstdc++.so.6.0.16
>      2.27%           python  _net_proto2___python.so
>      0.92%           python  libpthread-2.15.so
>      0.76%     mesos-master  libpthread-2.15.so
>
> I'm surprised how much time is spent in the python process (python2.7
> - 33.43%) given what I measured from resourceOffers and statusUpdate.
> Native python profilers don't seem to be of any assistance here (I
> tried cProfile and yappi).  I ran top and used the H option to view
> threads, and verified (with gettid()) that the majority (~80%) of the
> time is spent in the scheduler driver thread.  The other ~20% was in a
> thread that I couldn't identify, which leads me to believe that's the
> libprocess event loop thread.
>
> The amount of time in the kernel is a bit disconcerting; I do wonder
> if the Mesos model simply requires too much communication (for a given
> task, how many client<->master, client<->slave, and master<->slave
> messages are there?), while direct scheduling would allow for 2, and
> could then use Mesos for higher level resource allocation tasks and
> failure detection/handling.  Furthermore, direct schedule would allow
> for distributed scheduling.
>
> On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman
> <be...@eecs.berkeley.edu> wrote:
> > Scott,
> >
> > This is amazingly great! We are committed to improving the performance of
> > Mesos (and the underlying communication library, libprocess), and getting
> > very detailed information like this is incredibly valuable. There are
> > definitely a lot of low-hanging performance optimizations that we just
> > haven't had a chance to get to yet.
> >
> > There is not a great document today that describes (a) the structural
> > relationships between different components or (b) the protocol used
> between
> > components (but it's pretty simple, it's HTTP with bodies that are just
> > binary protocol buffers). I'll try and take it upon myself to deliver
> this
> > document within the month. In the mean time, feel free to create a JIRA
> > ticket about improving the performance of Mesos so that we can drill down
> > to some real actionable code changes that can get committed.
> >
> > Thanks so much!
> >
> > Ben.
> >
> >
> >
> > On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <sc...@gmail.com>
> wrote:
> >
> >> I am working on a project using Dpark, a port of Spark to python.  We
> >> are running into issues where the overhead for scheduling a task is
> >> too high.  On the m1.small nodes I'm using, it takes a minimum of
> >> about 7ms to schedule and receive the completion for a scheduled task.
> >>  I set up timers around the call resourceOffers and statusUpdate, and
> >> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
> >> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
> >> so the actual Python CPU time is quite low), so it would seem that the
> >> mesos master and client library are taking the other 6.2ms.  This
> >> seems excessive, and I'm guessing there's some low hanging fruit to
> >> optimize.
> >>
> >> I captured the output using strace, and found this:
> >>
> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> >> st_size=91, ...}) = 0
> >> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
> >> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
> >> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
> >> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
> >> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
> >> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
> >> now in progress)
> >> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
> >> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
> >> u64=4294967384}}}, 64, 573) = 1
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
> >> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
> >> u64=519691042951}}) = 0
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
> >> [pid 25362] epoll_wait(87,  <unfinished ...>
> >> [pid 25361] <... write resumed> )       = 8
> >> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> >> st_size=91, ...}) = 0
> >> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
> >> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
> >> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
> >> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
> >> u64=519691042951}}}, 64, 572) = 1
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
> >> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
> >> u64=523986010247}}) = 0
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
> >> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
> >> 64, 571) = 1
> >> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
> >> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
> >> MSG_NOSIGNAL, NULL, 0) = 301
> >> [pid 25362] close(135)                  = 0
> >> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file
> descriptor)
> >>
> >> From this there are several possible optimizations:
> >> 1. Don't context switch so often.  Opening a socket in one thread just
> >> to toss it to another thread is inefficient as the kernel must switch
> >> between processes, plus you need extra calls for the eventfd to signal
> >> the other thread.
> >> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
> >> 3. Don't call shutdown after close.  You don't really need to call it at
> >> all.
> >> 4. Do more than 1 thing with a connection.  As far as I can tell, this
> >> is a communication from the client library to the slave; it should be
> >> possible for those to be longer living.
> >> 5. Don't use gettimeofday so often; it is unreliable since time can
> >> move backwards.  Excessive system calls in general should be avoided,
> >> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
> >> eliminated, and it could rely on some infrequently updated and coarse
> >> value stored in memory.
> >>
> >> I'm not sure how much of this will address my performance issues or
> >> not, but it seemed like a reasonable start.  I'm debating whether to
> >> dive into the code myself, but the few times I took a look I quickly
> >> got lost.  Is there a diagram on how the libraries/files fit together,
> >> or maybe a protocol description (so I could just replace it with a
> >> native python library?)
> >>
> >> Or alternatively, are there plans to improve the performance of Mesos?
> >>  I know the Spark authors have a mode to bypass Mesos in their next
> >> release; I'm debating whether to follow suit or not.
> >> --
> >>         Scott
> >>
>
>
>
> --
>         Scott
>

Re: Per task overhead

Posted by Scott Smith <sc...@gmail.com>.
btw I tried calling 'link' for each new slave in src/sched/sched.cpp.
It got rid of the excess connections, but only helped ~5-10%.

I then noticed that the proxy scheduler converts to serialized pb and
then back via python in order to hand objects over the wall.  I tried
the CPP protobuf implementation for python, and that helped another
10%.

The resulting trace from perf (Linux system profiler) looks like this:

    33.43%           python  python2.7
    12.21%     mesos-master  libmesos-0.9.0.so
    11.11%     mesos-master  [kernel.kallsyms]
     9.12%           python  [kernel.kallsyms]
     8.37%     mesos-master  libc-2.15.so
     7.11%           python  libc-2.15.so
     5.26%           python  _mesos.so
     4.16%     mesos-master  libstdc++.so.6.0.16
     2.76%           python  libstdc++.so.6.0.16
     2.27%           python  _net_proto2___python.so
     0.92%           python  libpthread-2.15.so
     0.76%     mesos-master  libpthread-2.15.so

I'm surprised how much time is spent in the python process (python2.7
- 33.43%) given what I measured from resourceOffers and statusUpdate.
Native python profilers don't seem to be of any assistance here (I
tried cProfile and yappi).  I ran top and used the H option to view
threads, and verified (with gettid()) that the majority (~80%) of the
time is spent in the scheduler driver thread.  The other ~20% was in a
thread that I couldn't identify, which leads me to believe that's the
libprocess event loop thread.

The amount of time in the kernel is a bit disconcerting; I do wonder
if the Mesos model simply requires too much communication (for a given
task, how many client<->master, client<->slave, and master<->slave
messages are there?), while direct scheduling would allow for 2, and
could then use Mesos for higher level resource allocation tasks and
failure detection/handling.  Furthermore, direct schedule would allow
for distributed scheduling.

On Thu, Sep 20, 2012 at 12:43 PM, Benjamin Hindman
<be...@eecs.berkeley.edu> wrote:
> Scott,
>
> This is amazingly great! We are committed to improving the performance of
> Mesos (and the underlying communication library, libprocess), and getting
> very detailed information like this is incredibly valuable. There are
> definitely a lot of low-hanging performance optimizations that we just
> haven't had a chance to get to yet.
>
> There is not a great document today that describes (a) the structural
> relationships between different components or (b) the protocol used between
> components (but it's pretty simple, it's HTTP with bodies that are just
> binary protocol buffers). I'll try and take it upon myself to deliver this
> document within the month. In the mean time, feel free to create a JIRA
> ticket about improving the performance of Mesos so that we can drill down
> to some real actionable code changes that can get committed.
>
> Thanks so much!
>
> Ben.
>
>
>
> On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <sc...@gmail.com> wrote:
>
>> I am working on a project using Dpark, a port of Spark to python.  We
>> are running into issues where the overhead for scheduling a task is
>> too high.  On the m1.small nodes I'm using, it takes a minimum of
>> about 7ms to schedule and receive the completion for a scheduled task.
>>  I set up timers around the call resourceOffers and statusUpdate, and
>> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
>> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
>> so the actual Python CPU time is quite low), so it would seem that the
>> mesos master and client library are taking the other 6.2ms.  This
>> seems excessive, and I'm guessing there's some low hanging fruit to
>> optimize.
>>
>> I captured the output using strace, and found this:
>>
>> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> st_size=91, ...}) = 0
>> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
>> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
>> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
>> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
>> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
>> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
>> now in progress)
>> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
>> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
>> u64=4294967384}}}, 64, 573) = 1
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
>> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
>> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
>> u64=519691042951}}) = 0
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
>> [pid 25362] epoll_wait(87,  <unfinished ...>
>> [pid 25361] <... write resumed> )       = 8
>> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
>> st_size=91, ...}) = 0
>> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
>> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
>> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
>> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
>> u64=519691042951}}}, 64, 572) = 1
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
>> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
>> u64=523986010247}}) = 0
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
>> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
>> 64, 571) = 1
>> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
>> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
>> MSG_NOSIGNAL, NULL, 0) = 301
>> [pid 25362] close(135)                  = 0
>> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file descriptor)
>>
>> From this there are several possible optimizations:
>> 1. Don't context switch so often.  Opening a socket in one thread just
>> to toss it to another thread is inefficient as the kernel must switch
>> between processes, plus you need extra calls for the eventfd to signal
>> the other thread.
>> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
>> 3. Don't call shutdown after close.  You don't really need to call it at
>> all.
>> 4. Do more than 1 thing with a connection.  As far as I can tell, this
>> is a communication from the client library to the slave; it should be
>> possible for those to be longer living.
>> 5. Don't use gettimeofday so often; it is unreliable since time can
>> move backwards.  Excessive system calls in general should be avoided,
>> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
>> eliminated, and it could rely on some infrequently updated and coarse
>> value stored in memory.
>>
>> I'm not sure how much of this will address my performance issues or
>> not, but it seemed like a reasonable start.  I'm debating whether to
>> dive into the code myself, but the few times I took a look I quickly
>> got lost.  Is there a diagram on how the libraries/files fit together,
>> or maybe a protocol description (so I could just replace it with a
>> native python library?)
>>
>> Or alternatively, are there plans to improve the performance of Mesos?
>>  I know the Spark authors have a mode to bypass Mesos in their next
>> release; I'm debating whether to follow suit or not.
>> --
>>         Scott
>>



-- 
        Scott

Re: Per task overhead

Posted by Benjamin Hindman <be...@eecs.berkeley.edu>.
Scott,

This is amazingly great! We are committed to improving the performance of
Mesos (and the underlying communication library, libprocess), and getting
very detailed information like this is incredibly valuable. There are
definitely a lot of low-hanging performance optimizations that we just
haven't had a chance to get to yet.

There is not a great document today that describes (a) the structural
relationships between different components or (b) the protocol used between
components (but it's pretty simple, it's HTTP with bodies that are just
binary protocol buffers). I'll try and take it upon myself to deliver this
document within the month. In the mean time, feel free to create a JIRA
ticket about improving the performance of Mesos so that we can drill down
to some real actionable code changes that can get committed.

Thanks so much!

Ben.



On Thu, Sep 20, 2012 at 11:24 AM, Scott Smith <sc...@gmail.com> wrote:

> I am working on a project using Dpark, a port of Spark to python.  We
> are running into issues where the overhead for scheduling a task is
> too high.  On the m1.small nodes I'm using, it takes a minimum of
> about 7ms to schedule and receive the completion for a scheduled task.
>  I set up timers around the call resourceOffers and statusUpdate, and
> I was able to get Dpark down to about 0.7-0.8ms per task (of that,
> 0.4-0.5ms is taken constructing the protobuf and calling launchTasks,
> so the actual Python CPU time is quite low), so it would seem that the
> mesos master and client library are taking the other 6.2ms.  This
> seems excessive, and I'm guessing there's some low hanging fruit to
> optimize.
>
> I captured the output using strace, and found this:
>
> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> st_size=91, ...}) = 0
> [pid 25361] gettimeofday({1348162571, 917171}, NULL) = 0
> [pid 25361] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 135
> [pid 25361] fcntl(135, F_GETFL)         = 0x2 (flags O_RDWR)
> [pid 25361] fcntl(135, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> [pid 25361] fcntl(135, F_SETFD, FD_CLOEXEC) = 0
> [pid 25361] connect(135, {sa_family=AF_INET, sin_port=htons(55048),
> sin_addr=inet_addr("10.248.101.245")}, 16) = -1 EINPROGRESS (Operation
> now in progress)
> [pid 25361] write(88, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
> [pid 25362] <... epoll_wait resumed> {{EPOLLIN, {u32=88,
> u64=4294967384}}}, 64, 573) = 1
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409706897}) = 0
> [pid 25362] read(88, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 25362] epoll_ctl(87, EPOLL_CTL_ADD, 135, {EPOLLOUT, {u32=135,
> u64=519691042951}}) = 0
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 409917492}) = 0
> [pid 25362] epoll_wait(87,  <unfinished ...>
> [pid 25361] <... write resumed> )       = 8
> [pid 25361] stat("/etc/resolv.conf", {st_mode=S_IFREG|0444,
> st_size=91, ...}) = 0
> [pid 25361] gettimeofday({1348162571, 918516}, NULL) = 0
> [pid 25361] gettimeofday({1348162571, 918635}, NULL) = 0
> [pid 25361] gettimeofday({1348162571, 918731}, NULL) = 0
> [pid 25362] <... epoll_wait resumed> {{EPOLLOUT, {u32=135,
> u64=519691042951}}}, 64, 572) = 1
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 410856918}) = 0
> [pid 25362] getsockopt(135, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> [pid 25362] epoll_ctl(87, EPOLL_CTL_MOD, 135, {EPOLLOUT, {u32=135,
> u64=523986010247}}) = 0
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411067635}) = 0
> [pid 25362] epoll_wait(87, {{EPOLLOUT, {u32=135, u64=523986010247}}},
> 64, 571) = 1
> [pid 25362] clock_gettime(CLOCK_MONOTONIC, {70130, 411206938}) = 0
> [pid 25362] sendto(135, "POST /slave(1)/mesos.internal.St"..., 301,
> MSG_NOSIGNAL, NULL, 0) = 301
> [pid 25362] close(135)                  = 0
> [pid 25362] shutdown(135, 0 /* receive */) = -1 EBADF (Bad file descriptor)
>
> From this there are several possible optimizations:
> 1. Don't context switch so often.  Opening a socket in one thread just
> to toss it to another thread is inefficient as the kernel must switch
> between processes, plus you need extra calls for the eventfd to signal
> the other thread.
> 2. Skip DNS resolution.  Unnecessary to stat resolv.conf so often.
> 3. Don't call shutdown after close.  You don't really need to call it at
> all.
> 4. Do more than 1 thing with a connection.  As far as I can tell, this
> is a communication from the client library to the slave; it should be
> possible for those to be longer living.
> 5. Don't use gettimeofday so often; it is unreliable since time can
> move backwards.  Excessive system calls in general should be avoided,
> so the 4 calls to gettimeofday and 5 calls to clock_gettime could be
> eliminated, and it could rely on some infrequently updated and coarse
> value stored in memory.
>
> I'm not sure how much of this will address my performance issues or
> not, but it seemed like a reasonable start.  I'm debating whether to
> dive into the code myself, but the few times I took a look I quickly
> got lost.  Is there a diagram on how the libraries/files fit together,
> or maybe a protocol description (so I could just replace it with a
> native python library?)
>
> Or alternatively, are there plans to improve the performance of Mesos?
>  I know the Spark authors have a mode to bypass Mesos in their next
> release; I'm debating whether to follow suit or not.
> --
>         Scott
>